runtime: HttpStress: connection failures with HTTP 1.1 caused by Windows Firewall
Currently, the the vast majority of our HttpStress failures (#42211) are HTTP 1.1 Socket connection failures. (Socket error 10060 running inside, 10061 running outside of Docker): https://dev.azure.com/dnceng/public/_build/results?buildId=1045441&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=ac0b0b0f-051f-52f7-8fb3-a7e384b0dde9&l=1244 https://gist.github.com/antonfirsov/dc7af2d9cb4213c835a41f59f09a0775#file-type1-txt
Update 1: I no longer think this is an issue with TIME_WAIT
, see https://github.com/dotnet/runtime/issues/50854#issuecomment-825613725.
Update 2: Disabling the Windows Firewall seems to fix the issue
While running these tests, there are thousands of Kestrel sockets in lingering in TIME_WAIT
. This may lead to port exhaustion, which I believe explains the CI failures.
It doesn’t look like the stress client’s ClientOperations
are doing anything unconventional – HttpRequestMessage
-es seem to be properly disposed.
Things I tried:
- Reducing
-cancelRate
to0
, but it did not change the amount of lingering sockets - (Unless I made a mistake in my experiment) After limiting
MaxConnectionsPerServer
to100
there were still thousands of lingering sockets
I’m not familiar enough with SocketsHttpHandler codebase and Kestrel (or webservers in general), so I can’t assess if what I’m seeing is extreme or not in a stress scenario.
@wfurt @geoffkizer @scalablecory @halter73 @stephentoub thoughts?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 41 (41 by maintainers)
@GrabYourPitchforks wrote that code a long time ago 😄
I think, I have a proof now, that this is very unlikely an issue with
TIME_WAIT
:By reducing cancelRate and the amount of
GET Aborted
by an order of magnitude, there are only a few thousand sockets lingering, but I’m still seeingWSAETIMEDOUT
(docker) andWSAECONNREFUSED
(outside of docker). From what I see, the probability of those errors happening scales with the amount of connection-recreations with a non-linear manner. They disappear whencancelRate=0
andP(GET Aborted)=0
, but even with very small rate of connection recreations there is a significant chance for them to happen.By increasing
cancelRate
to a very high value, while also settingTcpTimedWaitDelay = 5 minutes
, actual port exhaustion starts to happen after a while, with the expectedWSAEADDRINUSE
error as a symptom.The next step would be to collect ETW traces and packet captures from the test lab VM, however this gonna be difficult chore work, since the disk space is limited there, and the data we need to analyze is of several 10s of GB in scale.
It’s the operating system that keeps them open as part of the normal operation of TCP
Yes, but I wouldn’t pollute the code with hacks specific to
GET Aborted
, instead introduce a general logic that allows assigning weights to operations as described in point 2. of https://github.com/dotnet/runtime/issues/50854#issuecomment-817803800.+1 to what @alnikola said.
Also: re 1, I think it’s fine to reduce the cancel rate even further, like just make it 5% or whatever. We should still get plenty of coverage of cancellation, and having a bit of extra room here is probably a good thing.
As I understand the closer of the socket moves to
TIME_WAIT
and other side moves toCLOSE_WAIT
; so if the server has lots ofTIME_WAIT
it would suggest either lots of bad requests with server terminating them or a non-clean shutdown on the client (e.g. just exiting the process without closing the socket normally); hence why I was asking if it was being disposed, though it looks like it is (not sure what the underlying handler does when the client is disposed though)