runtime: Unknown socket error
Description
We have an application based on template dotnet create webapi project template and just adds an IHostedService to do background processing. Kestrel is used in order to be able to answer health checks from Kubernetes, as well as feed new requests into the service to be queued. Our background processing logic involves pretty heavy interaction with AWS resources (SQS, S3, STS), and we are encountering SocketExceptions that have been very difficult to pin down.
The canon guidance to use a static HttpClient instance application-wide does not work well when using the AWS clients in Amazon’s SDK packages. The various AWS client constructors, and the config objects you can hand in, do not accept an instance of HttpClient. Instead, you must either rely on their built-in caching of HttpClient instance(s) that are created internally, or you have to provide an implementation that derives from their HttpClientFactory. This is fairly straight-forward if you use IHttpClientFactory in .Net. The Create() method override that you implement simply calls .Create() on an instance of IHttpClientFactory provided via DI.
However, even when providing this HttpClientFactory to all instances of any AWS client that is instantiated anywhere and everywhere in this service, it does not solve the SocketException problem. I will provide the stack trace below, but the properties of the actual SocketException object are extremely unhelpful:
Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074
The twist here is that these errors have only been encountered on Linux/in containers. Extended running tests on Windows cannot reproduce the problem when tested under load.
The other twist is that our Kestrel calls are producing System.InvalidOperationException: Handle is already used by another Socket. errors (they are logged as Warning level) in Kestrel’s handling pipeline. It would appear that Kestrel, and the HttpWebRequest pool used underneath IHttpClientFactory are stomping on each other. If I am mischaracterizing the issue and there is a different cause, please let me know.
The fact that the runtime is throwing a SocketException with an error code that isn’t even in the spec for TCP errors (nice article here) is the reason I am bringing it to this group. The runtime is burping up what would appear to be the equivalent of the default case in a switch statement.
Any help or guidance here is appreciated.
Reproduction Steps
The basic structure is a Web API project created with dotnet create webapi with an IHostedService doing background processing interacting heavily with AWS resources. I can provide application code privately, as needed, so as to not have to share proprietary/private logic and access keys publicly.
Expected behavior
Interaction with AWS resources from a web application should not cause SocketExceptions.
Actual behavior
We are encountering SocketExceptions that have error codes that aren’t even part of the TCP specs:
Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074
Full stack trace:
System.Net.Http.HttpRequestException: Unknown socket error (sqs.us-west-2.amazonaws.com:443)
---> System.Net.Sockets.SocketException (0xFFFDFFFE): Unknown socket error
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
at System.Threading.Tasks.TaskCompletionSourceWithCancellation1.WaitWithCancellationAsync(CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at Microsoft.Extensions.Http.PolicyHttpMessageHandler.SendCoreAsync(HttpRequestMessage request, Context context, CancellationToken cancellationToken)
at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates1 shouldRetryResultPredicates, Func5 onRetryAsync, Int32 permittedRetryCount, IEnumerable1 sleepDurationsEnumerable, Func4 sleepDurationProvider, Boolean continueOnCapturedContext)
at Polly.AsyncPolicy1.ExecuteAsync(Func3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext)
at Microsoft.Extensions.Http.PolicyHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
at Amazon.Runtime.Internal.HttpHandler1.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.SQS.Internal.ValidationResponseHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)
at <our code starts here>
Regression?
The application in its current form is a .Net 6 solution. In a prior version, it was .Net 5 and the SocketExceptions were very rare if not non-existent.
However, even in the .Net 5 solution Kestrel was throwing the System.InvalidOperationException: Handle is already used by another Socket. warnings when handling basic health check calls.
Known Workarounds
We have implemented heavy retry policies using Polly, which allows the application to move past the errors, but it quite often takes a near complete restart of the background processing logic to clear up the SocketExceptions and allow the application to restart a job and create new connections to the AWS resources being used.
Configuration
- .NET 6.
- Ubuntu 20.04 via the base docker images provided directly by Microsoft
- x64, Kubernetes in AWS EKS
The errors only appear running in containers on Linux.
Other information
N/A
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 72 (40 by maintainers)
Yes, the docs should be improved here. The intent I’m sure was to highlight how an IDisposable that owns a resource (a safe handle) can dispose of it. That should be made more clear, and an easy fix to the code (beyond comments that make it clear you shouldn’t actually copy that SafeFileHandle into your own code) is to change the 0 to a -1.
The docs are open source. Contributions to improve them are welcome.
Holy crap, my teammate solved it.
This was the entire fix:
I’m still in disbelief about it.
How? Why? How did this code function perfectly for 100s of requests and then suddenly stop, ruining all traffic to and from the service? Why would it sometimes recover? Why was it crazily intermittent? What even IS a SafeHandle? I’ve never used one, personally.
Why did this function on Linux at all if it literally says Win32 in the file name?! Why did it sometimes stop functioning?! Why was it even more intermittent back in .NET Core 2.2?!
But beyond the questions, what I have most right now is joy. I’m so happy to be done with this. Thank God.
Yes, this stacktrace doesn’t tell us much. We’re interested to find out who registered the first Socket for this fd.
Shouldn’t we persist the native error code instead of mapping it there and back, inevitably loosing information? Seems like an untrivial refactor though and requires a new
SocketExceptionconstructor, or the change of the current ctr. semantics:https://github.com/dotnet/runtime/blob/73ddf6e50e20a81492209d14588a05ee9a2b68d4/src/libraries/System.Net.Primitives/src/System/Net/SocketException.cs#L19-L31
Actually, this is fragment from
TryCompleteConnectwould setUnknownerror regardless of the OS failure.https://github.com/dotnet/runtime/blob/1df36c702a46363336b6ea5d0d9558d513e7be8e/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs#L709-L714
It exists for very long time but the path above with failed
Interop.Sys.Pollperhaps not. I wish we have error trace there but we don’t. If there are experiments with instrumented build it would be nice to know if we are hitting any of the error path here @antonfirsov .Handle is already used by another Socket:I’ve implemented the stacktrace tracking in https://github.com/tmds/runtime/tree/socketcontext_stacktrace. Below is the
System.Net.Sockets.dllwhich you can place in a6.0.101sdk atshared/Microsoft.NETCore.App/6.0.1/System.Net.Sockets.dll. The exception message will include the stacktrace of the first registration.System.Net.Sockets.dll.tar.gz
About Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074.:Perhaps the agent makes the request that causes the socket error, so removing it gets rid of the socket error.
It doesn’t mean the agent does something wrong. It can still be an issue with .NET.
The weird
NativeErrorCodesuggests .NET has an issue understanding/dealing with an error that happens during connect.There is no clear relation (for me) between newrelic/newrelic-dotnet-agent#803 and the unknown socket error.
@antonfirsov it doesn’t happen often, but when it does, it is a pain to debug. And when I see this, I wonder: can there still be a bug in
SocketAsyncEngine…Thinking out loud.
We could add some envvar which causes the
SocketAsyncEngineto track for eachSocketAsyncContexttheEnvironment.StackTracethat does the registration. Then, when we throw this exception, we could include that stacktrace.Or we extend the event source logging so it contains the
StackTraceand the user can try match the handle values.About
Handle is already used by another Socket:There is a related issue (https://github.com/dotnet/runtime/issues/56750, which may probably be closed). A blog post gets mentioned In that case, the issue disappeared when removing a 3rd party library: https://zblesk.net/blog/aspnetcore-identity-litedb-breaks-on-ubuntu/.
cc @antonfirsov @karelz
About
Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074.Maybe you can make a small reproducer by making
HttpClientperform a call againstsqs.us-west-2.amazonaws.com:443.Any chance you can run it under
straceto see what is happening at OS level? The error will likely come from kernel so base OS is probably more important than the container.cc: @tmds