runtime: HttpRequestException when using SocketsHttpHandler
Hi all,
we had to drop the SocketsHttpHandler
(https://github.com/ravendb/ravendb/commit/9ad7c41233fca1e3f16617ad0eadf191251c0e88)
because for quite some time we started receiving random test failures on our CI (Jenkins) Agent (Hyper-V machines, 6 vCPU, 6GB RAM on WS2012R2 and WS2016 - fully updated). We tried to reproduce this on our local dev machines, but without any luck. After turning off the SocketsHttpHandler, the random failures stopped showing up.
The exception we are getting is:
System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
--- End of inner exception stack trace ---
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in c:\\Jenkins\\workspace\\PR_Tests_v4.1\\s\\src\\Raven.Client\\Http\\RequestExecutor.cs:line 738
We are using latest .NET Core 2.1.1 (2.1.0 was also affected).
Do you know if there are some problems in SocketsHttpHandler that could cause this? What can we do to help you solve this issue? Any pointers?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 14
- Comments: 114 (82 by maintainers)
Commits related to this issue
- Skip falky tests caused by https://github.com/dotnet/corefx/issues/30691 — committed to aspnet/KestrelHttpServer by deleted user 6 years ago
- RavenDB-12279 need to skip tests because they are causing random socket failures in subsequent tests around 30 seconds after they are executed (https://github.com/dotnet/corefx/issues/30691) — committed to iftahbe/ravendb by ppekrol 6 years ago
- RavenDB-12279 need to skip tests because they are causing random socket failures in subsequent tests around 30 seconds after they are executed (https://github.com/dotnet/corefx/issues/30691) — committed to ravendb/ravendb by ppekrol 6 years ago
@stephentoub Is there anything here that we can look at? To reproduce this we need to schedule > 5 runs of 4,000+ tests that takes an hour plus. In such scenario, we typically see 1 - 3 such failures using
SocketsHttpHandler
.Note that we never use
SocketsHttpHandler
directly, so I don’t think that we improperly using it. We got throughHttpClient
, a shared instance that is re-used among threads, as recommended.The code is actually doing something like:
So we can catch the exception and then check state / logging / something.
Anything you can recommend that we’ll look at to get to the root cause?
The bar for servicing fixes is very high. You’re suggesting adding unexposed, unvalidated, functionality-accessible-only-via-reflection-and-that-we-encourage-developers-to-use-via-reflection as a servicing fix. Hence “unlikely”. If you need it immediately, you can of course make the change in your own local fork.
Let’s get to the bottom of the problem and then figure out the right fix.
cc: @danmosemsft
I was finally able to collect some useful System.Net traces using perfview. So far I’ve only been able to confirm the observations made earlier in this thread: It’s a socket abort, but the socket is not aborted in the test code, and the socket has not been open long enough for it to be a timeout.
I’m going to dig into the logs in more depth to see if I can gather any new info, but may end up having to enable some additional logging in order to learn anything new.
This issue drove me crazy for a really long time. In case it helps anyone, I added some retries with Polly:
This seems to have fixed it for me - I haven’t had any random exceptions during local development or emails from the production server for several weeks now 👍
@stephentoub I see you moved it to Future, while we had it tracking for servicing 2.1/2.2 (and 3.0). Was there new info about the scenario impact? AFAIK we had 2-3 independent customers affected.
Hi @stephentoub
Is this issue solved? We are facing the same error. This happens with multiple threads. Works fine with single thread. Any help will be highly appreciated. Thanks
I haven’t but it’s on my list for some point…
If that actually “fixes” it, I have trouble believing it’s anything other than masking a symptom, and we should actually find the root cause and fix that.
Unless we believe the root cause really is that the socket handle is getting inherited and somehow that child process then closing is causing problems for the socket? In which case we should be able to repro that pretty easily, and if that’s actually the problem, then I agree it would be a candidate for back porting.
@rmkerr
Interesting development. We have encountered this issue after adding to our test suite a test that spawns a process with powershell script (very simple one - writing 1 line to a file). ~30 seconds after this tests there is a failure in 1-3 random subsequent tests.
Because of that we thought that this might be related to processes inheriting handles from the parent one and it appears that we are right.
We have made a PR that unskipps those tests and switches the implementation of Process to the one that sets ‘inherit handles’ to ‘false’. After doing so everything seems to be fine, it looks like the issue is with handle inheritance and the behavior of those handles after that child process ends.
The PR in question is here: https://github.com/ravendb/ravendb/pull/8028
Our idea is that, at least as a workaround, it would be good to be able to set the https://github.com/dotnet/corefx/blob/f798c1a103c9b7c899ea43e0d978f7e1f7458473/src/System.Diagnostics.Process/src/System/Diagnostics/Process.Windows.cs#L575 to ‘false’. Maybe even using reflection via setting some private field inside Process class.
You can use Kestrel’s master branch which targets 3.0. You’ll probably want to use a commit prior to https://github.com/aspnet/KestrelHttpServer/commit/2a610ee1b8d00eed8afe77252c966ecc1e695272 where I moved the h2spec tests to a new test group/project. This seems to prevent repro or at least reduce the likelihood of it occurring.
We are seeing the exact same thing when running our applications on Linux. A temporary “workaround” for us has been to set the timeout to 100 seconds(it’s not an acceptable work around).
When we have the timeout to something more reasonable, like 10 seconds or something, we get the error every 500 request or something like that.
The httpendpoint that we are calling are under our control and when the error happens, we can’t even see that the request has reached the service. So it seems like the request doesn’t leave the box. We are using dotnet core 2.1.
I’ve started a stopwatch for each requests and then Im catching TaskCanceledException and logs the following(this is when we have the timeout set to 15 seconds):
Task was cancelled after 15000ms System.Threading.Tasks.TaskCanceledException: The operation was canceled. —> System.IO.IOException: Unable to read data from the transport connection: Operation canceled. —> System.Net.Sockets.SocketException: Operation canceled — End of inner exception stack trace — at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token) at System.Net.Security.SslStreamInternal.<FillBufferAsync>g__InternalFillBufferAsync|38_0[TReadAdapter](TReadAdapter adap, ValueTask`1 task, Int32 min, Int32 initial)
@JunTaoLuo no update yet. Our work was interrupted with other high-pris, but this is now near top priority for @rmkerr.
It’s definitely possible that the issues are related, but looks like aspnet/KestrelHttpServer#2789 manifests very differently than the issue tracked here. If I’m not able to get any useful data from the current repros it may be worth revisiting though.
I’ve been trying to figure out whether the Kestrel test failure repros on its own while the CPU is under load, or if it also requires some interaction with one of the other test classes that run in parallel. I don’t have an answer yet, but I have time to focus on this issue today and will report back when I know more.
I don’t see how https://github.com/aspnet/KestrelHttpServer/issues/2789 is related. Maybe I am missing something.
I’m finally able to repro this consistently. Thanks for the help 😃
I’m going to start with collecting System.Net logs, and see what I can work out from there.
It started to happening all of a sudden. First we thought that it might be related to 2.0 -> 2.1 migration, but at the end we think it is related to
SocketsHttpHandler
andRavenDB_7787.cs
only.Check my explanation here: https://github.com/dotnet/corefx/issues/30691#issuecomment-401775334
@karelz
I’m not sure what is needed to run from ‘master’, but if you want to update
<RuntimeFrameworkVersion>
in csprojs (all of them) then best option to do so would be to run:Also our CI is executing tests in ‘Release’.
We have some more information (still no direct repro).
We have managed to find out exactly which particular test is causing those errors. This is
RavenDB_7787
- a test that is spawning a separate process with our Server and doing some basic CRUD operations on it.The server is started with random port (we are passing
http://127.82.86.78:0
) and from our CI logs we can see that the address is generated correctly (Embedded URL: http://127.82.86.78:60101
). What is worth noting is that we are killing this Server and whole external process in a bit rude way - usingprocess.Kill
.After this test completes, 20-30 seconds later, few tests are failing with the exception that I wrote about in the issue. Those tests are connecting to global server that is available for whole test suite (and this server is working on a different url and port). After few (2-3) of those failures the tests are starting to work in a normal way.
It looks like something is going on after 20-30 seconds after process was killed and for some time it affects other socket connections.
This is our CI log (the 7787 turns on a switch that writes on console url’s and http methods for sanity check - to know if we are hitting appropriate server). The code is available here: https://github.com/ppekrol/ravendb/commits/v4.1-temp4
FYI, skipping this test resolves the issue.