runtime: HttpStress: connection failures with HTTP 1.1 caused by Windows Firewall

Currently, the the vast majority of our HttpStress failures (#42211) are HTTP 1.1 Socket connection failures. (Socket error 10060 running inside, 10061 running outside of Docker): https://dev.azure.com/dnceng/public/_build/results?buildId=1045441&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=ac0b0b0f-051f-52f7-8fb3-a7e384b0dde9&l=1244 https://gist.github.com/antonfirsov/dc7af2d9cb4213c835a41f59f09a0775#file-type1-txt

Update 1: I no longer think this is an issue with TIME_WAIT, see https://github.com/dotnet/runtime/issues/50854#issuecomment-825613725.

Update 2: Disabling the Windows Firewall seems to fix the issue

While running these tests, there are thousands of Kestrel sockets in lingering in TIME_WAIT. This may lead to port exhaustion, which I believe explains the CI failures.

It doesn’t look like the stress client’s ClientOperations are doing anything unconventional – HttpRequestMessage-es seem to be properly disposed.

Things I tried:

Reducing -cancelRate to 0, but it did not change the amount of lingering sockets
(Unless I made a mistake in my experiment) After limiting MaxConnectionsPerServer to 100 there were still thousands of lingering sockets

I’m not familiar enough with SocketsHttpHandler codebase and Kestrel (or webservers in general), so I can’t assess if what I’m seeing is extreme or not in a stress scenario.

@wfurt @geoffkizer @scalablecory @halter73 @stephentoub thoughts?

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 41 (41 by maintainers)

Commits related to this issue

HttpStress: Disable firewall in Windows runs (#52381) #50854 seems to be rooted in an unknown firewall problem, and from the POV of http stress testing it's a false negative. We need a quick solution... — committed to dotnet/runtime by antonfirsov 3 years ago

Most upvoted comments

@GrabYourPitchforks wrote that code a long time ago 😄

davidfowl on May 6, 2021

I think, I have a proof now, that this is very unlikely an issue with TIME_WAIT:

By reducing cancelRate and the amount of GET Aborted by an order of magnitude, there are only a few thousand sockets lingering, but I’m still seeing WSAETIMEDOUT (docker) and WSAECONNREFUSED (outside of docker). From what I see, the probability of those errors happening scales with the amount of connection-recreations with a non-linear manner. They disappear when cancelRate=0 and P(GET Aborted)=0, but even with very small rate of connection recreations there is a significant chance for them to happen.
By increasing cancelRate to a very high value, while also setting TcpTimedWaitDelay = 5 minutes, actual port exhaustion starts to happen after a while, with the expected WSAEADDRINUSE error as a symptom.

The next step would be to collect ETW traces and packet captures from the test lab VM, however this gonna be difficult chore work, since the disk space is limited there, and the data we need to analyze is of several 10s of GB in scale.

antonfirsov on Apr 23, 2021

I know this is late in the discussion but why HttpClient keeps the half-closed sockets after cancellation?

It’s the operating system that keeps them open as part of the normal operation of TCP

benaadams on Apr 20, 2021

Can we just reduce it significantly?

Yes, but I wouldn’t pollute the code with hacks specific to GET Aborted, instead introduce a general logic that allows assigning weights to operations as described in point 2. of https://github.com/dotnet/runtime/issues/50854#issuecomment-817803800.

antonfirsov on Apr 15, 2021

+1 to what @alnikola said.

Also: re 1, I think it’s fine to reduce the cancel rate even further, like just make it 5% or whatever. We should still get plenty of coverage of cancellation, and having a bit of extra room here is probably a good thing.

geoffkizer on Apr 12, 2021

As I understand the closer of the socket moves to TIME_WAIT and other side moves to CLOSE_WAIT; so if the server has lots of TIME_WAIT it would suggest either lots of bad requests with server terminating them or a non-clean shutdown on the client (e.g. just exiting the process without closing the socket normally); hence why I was asking if it was being disposed, though it looks like it is (not sure what the underlying handler does when the client is disposed though)

benaadams on Apr 7, 2021