runtime: Unknown socket error

Description

We have an application based on template dotnet create webapi project template and just adds an IHostedService to do background processing. Kestrel is used in order to be able to answer health checks from Kubernetes, as well as feed new requests into the service to be queued. Our background processing logic involves pretty heavy interaction with AWS resources (SQS, S3, STS), and we are encountering SocketExceptions that have been very difficult to pin down.

The canon guidance to use a static HttpClient instance application-wide does not work well when using the AWS clients in Amazon’s SDK packages. The various AWS client constructors, and the config objects you can hand in, do not accept an instance of HttpClient. Instead, you must either rely on their built-in caching of HttpClient instance(s) that are created internally, or you have to provide an implementation that derives from their HttpClientFactory. This is fairly straight-forward if you use IHttpClientFactory in .Net. The Create() method override that you implement simply calls .Create() on an instance of IHttpClientFactory provided via DI.

However, even when providing this HttpClientFactory to all instances of any AWS client that is instantiated anywhere and everywhere in this service, it does not solve the SocketException problem. I will provide the stack trace below, but the properties of the actual SocketException object are extremely unhelpful:

Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074

The twist here is that these errors have only been encountered on Linux/in containers. Extended running tests on Windows cannot reproduce the problem when tested under load.

The other twist is that our Kestrel calls are producing System.InvalidOperationException: Handle is already used by another Socket. errors (they are logged as Warning level) in Kestrel’s handling pipeline. It would appear that Kestrel, and the HttpWebRequest pool used underneath IHttpClientFactory are stomping on each other. If I am mischaracterizing the issue and there is a different cause, please let me know.

The fact that the runtime is throwing a SocketException with an error code that isn’t even in the spec for TCP errors (nice article here) is the reason I am bringing it to this group. The runtime is burping up what would appear to be the equivalent of the default case in a switch statement.

Any help or guidance here is appreciated.

Reproduction Steps

The basic structure is a Web API project created with dotnet create webapi with an IHostedService doing background processing interacting heavily with AWS resources. I can provide application code privately, as needed, so as to not have to share proprietary/private logic and access keys publicly.

Expected behavior

Interaction with AWS resources from a web application should not cause SocketExceptions.

Actual behavior

We are encountering SocketExceptions that have error codes that aren’t even part of the TCP specs:

Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074

Full stack trace:

System.Net.Http.HttpRequestException: Unknown socket error (sqs.us-west-2.amazonaws.com:443)
 ---> System.Net.Sockets.SocketException (0xFFFDFFFE): Unknown socket error
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.PolicyHttpMessageHandler.SendCoreAsync(HttpRequestMessage request, Context context, CancellationToken cancellationToken)
   at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates1 shouldRetryResultPredicates, Func5 onRetryAsync, Int32 permittedRetryCount, IEnumerable1 sleepDurationsEnumerable, Func4 sleepDurationProvider, Boolean continueOnCapturedContext)
   at Polly.AsyncPolicy1.ExecuteAsync(Func3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext)
   at Microsoft.Extensions.Http.PolicyHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
   at Amazon.Runtime.Internal.HttpHandler1.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.SQS.Internal.ValidationResponseHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)
   at <our code starts here>

Regression?

The application in its current form is a .Net 6 solution. In a prior version, it was .Net 5 and the SocketExceptions were very rare if not non-existent.

However, even in the .Net 5 solution Kestrel was throwing the System.InvalidOperationException: Handle is already used by another Socket. warnings when handling basic health check calls.

Known Workarounds

We have implemented heavy retry policies using Polly, which allows the application to move past the errors, but it quite often takes a near complete restart of the background processing logic to clear up the SocketExceptions and allow the application to restart a job and create new connections to the AWS resources being used.

Configuration

  • .NET 6.
  • Ubuntu 20.04 via the base docker images provided directly by Microsoft
  • x64, Kubernetes in AWS EKS

The errors only appear running in containers on Linux.

Other information

N/A

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 72 (40 by maintainers)

Most upvoted comments

I feel like the docs are just using the SafeFileHandle as an example of how to dispose “a handle”. You should never actually use new SafeFileHandle(IntPtr.Zero, true);. It makes no sense - it’s a no-op on Windows and triggers armageddon on Linux.

Yes, the docs should be improved here. The intent I’m sure was to highlight how an IDisposable that owns a resource (a safe handle) can dispose of it. That should be made more clear, and an easy fix to the code (beyond comments that make it clear you shouldn’t actually copy that SafeFileHandle into your own code) is to change the 0 to a -1.

The docs are open source. Contributions to improve them are welcome.

Holy crap, my teammate solved it.

This was the entire fix: image

I’m still in disbelief about it.

How? Why? How did this code function perfectly for 100s of requests and then suddenly stop, ruining all traffic to and from the service? Why would it sometimes recover? Why was it crazily intermittent? What even IS a SafeHandle? I’ve never used one, personally.

Why did this function on Linux at all if it literally says Win32 in the file name?! Why did it sometimes stop functioning?! Why was it even more intermittent back in .NET Core 2.2?!

But beyond the questions, what I have most right now is joy. I’m so happy to be done with this. Thank God.

That doesn’t look like anything to do with the file system unless that’s under the hood.

Yes, this stacktrace doesn’t tell us much. We’re interested to find out who registered the first Socket for this fd.

Shouldn’t we persist the native error code instead of mapping it there and back, inevitably loosing information? Seems like an untrivial refactor though and requires a new SocketException constructor, or the change of the current ctr. semantics:

https://github.com/dotnet/runtime/blob/73ddf6e50e20a81492209d14588a05ee9a2b68d4/src/libraries/System.Net.Primitives/src/System/Net/SocketException.cs#L19-L31

Actually, this is fragment from TryCompleteConnect would set Unknown error regardless of the OS failure.

https://github.com/dotnet/runtime/blob/1df36c702a46363336b6ea5d0d9558d513e7be8e/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs#L709-L714

It exists for very long time but the path above with failed Interop.Sys.Poll perhaps not. I wish we have error trace there but we don’t. If there are experiments with instrumented build it would be nice to know if we are hitting any of the error path here @antonfirsov .

  • About Handle is already used by another Socket:

I would go with the idea from #64305 (comment), and share a private build of System.Net.Sockets.dll which contains some extra logging.

I’ve implemented the stacktrace tracking in https://github.com/tmds/runtime/tree/socketcontext_stacktrace. Below is the System.Net.Sockets.dll which you can place in a 6.0.101 sdk at shared/Microsoft.NETCore.App/6.0.1/System.Net.Sockets.dll. The exception message will include the stacktrace of the first registration.

System.Net.Sockets.dll.tar.gz

  • About About Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074.:

So I commented out every call to the NewRelic Agent, every using, every NuGet reference. The system has been under load for around 3 hours now and the errors have disappeared.

Perhaps the agent makes the request that causes the socket error, so removing it gets rid of the socket error.

It doesn’t mean the agent does something wrong. It can still be an issue with .NET.

The weird NativeErrorCode suggests .NET has an issue understanding/dealing with an error that happens during connect.

There is no clear relation (for me) between newrelic/newrelic-dotnet-agent#803 and the unknown socket error.

@antonfirsov it doesn’t happen often, but when it does, it is a pain to debug. And when I see this, I wonder: can there still be a bug in SocketAsyncEngine

Thinking out loud.

We could add some envvar which causes the SocketAsyncEngine to track for each SocketAsyncContext the Environment.StackTrace that does the registration. Then, when we throw this exception, we could include that stacktrace.

Or we extend the event source logging so it contains the StackTrace and the user can try match the handle values.

About Handle is already used by another Socket:

There is a related issue (https://github.com/dotnet/runtime/issues/56750, which may probably be closed). A blog post gets mentioned In that case, the issue disappeared when removing a 3rd party library: https://zblesk.net/blog/aspnetcore-identity-litedb-breaks-on-ubuntu/.

cc @antonfirsov @karelz

About Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074.

Maybe you can make a small reproducer by making HttpClient perform a call against sqs.us-west-2.amazonaws.com:443.

Any chance you can run it under strace to see what is happening at OS level? The error will likely come from kernel so base OS is probably more important than the container.
cc: @tmds