runtime: Deadlock in SocketAsyncContext.Unix.cs

While running XUnit tests on CI (Azure Pipelines) they would almost always hang on macos (we tried both 10.13 and 10.14 on azure pipelines) but never on linux and windows. Moreover the tests would only hang in RELEASE builds of macos but never (as far as I remember) on DEBUG builds. We’ve tested on both .net core 3.0 and .net core 3.1-preview 3.

After attaching lldb to the process and getting a thread dump, it appears that there’s a deadlock in SocketAsyncContext.Unix.cs (full logs attached for 2 instances):

000070000EF32550 0000000111b78efe (MethodDesc 0000000111cf2ee0 + 0x6e System.Net.Sockets.SocketAsyncContext+AsyncOperation.TryCancel()), calling (MethodDesc 000000011191bd68 + 0 System.Threading.SpinWait.SpinOnce())
000070000EF32580 0000000111b7a7bd (MethodDesc 0000000111cf3128 + 0x3d System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].StopAndAbort(System.Net.Sockets.SocketAsyncContext)), calling (MethodDesc 0000000111cf2ee0 + 0 System.Net.Sockets.SocketAsyncContext+AsyncOperation.TryCancel())
000070000EF325B0 0000000111b76eec (MethodDesc 0000000111cf2410 + 0x3c System.Net.Sockets.SocketAsyncContext.Close()), calling (MethodDesc 0000000111cf3128 + 0 System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].StopAndAbort(System.Net.Sockets.SocketAsyncContext))
000070000EF325E0 0000000111b4f0ea (MethodDesc 0000000111bc5ec0 + 0xca System.Net.Sockets.SafeSocketHandle.ReleaseHandle()), calling (MethodDesc 0000000111cf2410 + 0 System.Net.Sockets.SocketAsyncContext.Close())
000070000EF32600 000000011b4ea729 (MethodDesc 00000001113d1dd8 + 0xa9 System.Runtime.InteropServices.SafeHandle.InternalRelease(Boolean))
000070000EF32640 000000011b4ed8ca (MethodDesc 000000011140faa0 + 0x1a System.StubHelpers.StubHelpers.SafeHandleRelease(System.Runtime.InteropServices.SafeHandle)), calling (MethodDesc 00000001113d1dd8 + 0 System.Runtime.InteropServices.SafeHandle.InternalRelease(Boolean))
000070000EF32660 000000011137249d (MethodDesc 0000000111d93640 + 0xcd ILStubClass.IL_STUB_PInvoke(System.Runtime.InteropServices.SafeHandle, MessageHeader*, System.Net.Sockets.SocketFlags, Int64*)), calling (MethodDesc 000000011140faa0 + 0 System.StubHelpers.StubHelpers.SafeHandleRelease(System.Runtime.InteropServices.SafeHandle))
000070000EF326A0 000000011137246b (MethodDesc 0000000111d93640 + 0x9b ILStubClass.IL_STUB_PInvoke(System.Runtime.InteropServices.SafeHandle, MessageHeader*, System.Net.Sockets.SocketFlags, Int64*))
000070000EF32700 0000000111b50d7d (MethodDesc 0000000111bc6428 + 0x10d System.Net.Sockets.SocketPal.Receive(System.Net.Sockets.SafeSocketHandle, System.Net.Sockets.SocketFlags, System.Span`1<Byte>, Byte[], Int32 ByRef, System.Net.Sockets.SocketFlags ByRef, Error ByRef)), calling 0000000111367e68 (stub for Interop+Sys.ReceiveMessage(System.Runtime.InteropServices.SafeHandle, MessageHeader*, System.Net.Sockets.SocketFlags, Int64*))
000070000EF327D0 0000000111b50b25 (MethodDesc 0000000111bc6548 + 0x65 System.Net.Sockets.SocketPal.TryCompleteReceiveFrom(System.Net.Sockets.SafeSocketHandle, System.Span`1<Byte>, System.Collections.Generic.IList`1<System.ArraySegment`1<Byte>>, System.Net.Sockets.SocketFlags, Byte[], Int32 ByRef, Int32 ByRef, System.Net.Sockets.SocketFlags ByRef, System.Net.Sockets.SocketError ByRef)), calling (MethodDesc 0000000111bc6428 + 0 System.Net.Sockets.SocketPal.Receive(System.Net.Sockets.SafeSocketHandle, System.Net.Sockets.SocketFlags, System.Span`1<Byte>, Byte[], Int32 ByRef, System.Net.Sockets.SocketFlags ByRef, Error ByRef))
000070000EF327F0 000000010255e36d libcoreclr.dylib!ReleaseSemaphore + 0x3d, calling libcoreclr.dylib!CorUnix::InternalReleaseSemaphore(CorUnix::CPalThread*, void*, int, int*)
000070000EF32850 0000000111b5116f (MethodDesc 0000000111d93870 + 0xaf System.Net.Sockets.SocketAsyncContext+BufferMemoryReceiveOperation.DoTryComplete(System.Net.Sockets.SocketAsyncContext)), calling (MethodDesc 0000000111bc6548 + 0 System.Net.Sockets.SocketPal.TryCompleteReceiveFrom(System.Net.Sockets.SafeSocketHandle, System.Span`1<Byte>, System.Collections.Generic.IList`1<System.ArraySegment`1<Byte>>, System.Net.Sockets.SocketFlags, Byte[], Int32 ByRef, Int32 ByRef, System.Net.Sockets.SocketFlags ByRef, System.Net.Sockets.SocketError ByRef))
000070000EF328A0 0000000111b7a44e (MethodDesc 0000000111cf3108 + 0xae System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].ProcessQueuedOperation(System.__Canon)), calling 00000001109eb0a0
000070000EF32920 0000000111b7a35f (MethodDesc 0000000111cf30f8 + 0x2f System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].ProcessAsyncOperation(System.__Canon)), calling 000000011136b0b0 (stub for System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].ProcessQueuedOperation(System.__Canon))
000070000EF32950 0000000111b510ab (MethodDesc 0000000111cf82d0 + 0x2b System.Net.Sockets.SocketAsyncContext+ReadOperation.System.Threading.IThreadPoolWorkItem.Execute()), calling 000000011136b0a8 (stub for System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].ProcessAsyncOperation(System.__Canon))
000070000EF32960 000000011b4ef096 (MethodDesc 000000011198f378 + 0x1f6 System.Threading.ThreadPoolWorkQueue.Dispatch())

The deadlock seems to happen in this way:

  1. an AsyncOperation is queued, for instance BufferMemoryReceiveOperation as seen in the above stack trace
  2. the operation starts being executed on the thread pool
  3. after a series of calls, SocketPal.Receive makes a call to Interop.Sys.ReceiveMessage which calls some unmanaged OS-specific socket code. Eventually the socket handle is released in the unmanaged code which triggers SocketAsyncContext.Close()
  4. SocketAsyncContext.Close() attempts to call TryCancel() which waits indefinitely for all async operations to complete: https://github.com/dotnet/corefx/blob/master/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs#L212
  5. BufferMemoryReceiveOperation is itself an async operation, so it’s in effect waiting for itself to complete. Deadlock.

It appears that the reason it would not hang in DEBUG builds is because of this line: https://github.com/dotnet/corefx/blob/master/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs#L659 which is called here: https://github.com/dotnet/corefx/blob/master/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs#L1096

I’ve attached two instances of the problem occurring but we can reproduce the problem consistently if any additional information is required: macos_deadlock.zip

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 7
  • Comments: 15 (10 by maintainers)

Commits related to this issue

Most upvoted comments

I tried to reproduce this using the code in #381. Did run the tests on several OSX versions (10.13.4, 10.14.2 and 10.15.6) with both 3.1 and current master, and had no success.

I think the test code in the PR is not sufficient to trigger the behavior reported in the opening comment, since it has only one receive call which will be executed synchronously and no BufferMemoryReceiveOperation will be queued. The exception sometimes experienced in that test seems to come from CreateConnectedSocketPair() and is unrelated to the suspected deadlock behavior.

I tried a more advanced approach in order to enforce actual async work in the underlying OS stack, but got no deadlock on 10.14.2 (the only machine available for me for SSH debugging ATM).

I’m not sure if it’s possible to reproduce the issue by disposing the socket manually, but if yes, the problem should not be specific to macOS and should also repro on all Unixes.

I’m afraid we are missing other important circumstances that are also necessary for the deadlock to occur. The issue might be rooted in some unknown native bahivor occuring only on mac. From the opening comment:

  1. after a series of calls, SocketPal.Receive makes a call to Interop.Sys.ReceiveMessage which calls some unmanaged OS-specific socket code. Eventually the socket handle is released in the unmanaged code which triggers SocketAsyncContext.Close()

Do we know how is it possible for a socket close to be triggered from (the same?) unmanaged code while a recv is ongoing?

I don’t think it wouldn’t be wise to PR any changes without making sure we have a repro. @geoffkizer @tmds do you have any ideas what else could I try in my code? Or what can actually go wrong in the quoted point 3?

If no, I suggest to push this to Future.

Thank you I have created new issue about that case.