runtime: Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value

Windows will timeout a connection attempt after ~21s. Linux apparently waits much longer – around 3 minutes or more.

The connection pooling changes in 6.0 made it so that a pending connection is not associated with a specific request. As such, request cancellation no longer causes pending connections to be canceled. We do not impose any connect timeout by default (DefaultConnectTimeout = Timeout.InfiniteTimeSpan), so we are now exposed to Linux’s long connection timeout. In case of small request timeouts, this may lead to a potentially big number of requests timing out as they try to wait/reuse the same “pending” connection.

We should set a default ConnectTimeout to something reasonably small. We should also consider backporting that to 6.0.

I propose to unify Windows and Linux timeouts, i.e. set DefaultConnectTimeout to 21s.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 24 (24 by maintainers)

Commits related to this issue

Most upvoted comments

As I mentioned offline, I continue to think that this is working around a symptom rather than addressing the root of the problem.

The core problem stems from a change made in .NET 6. When a request arrives and there’s no connection immediately available in the pool and there’s no MaxConnectionsPerServer limit set that’s been reached, a new connection will be created. In .NET 5 and earlier, that connection was dedicated to this specific request and vice versa: this request would be made on this connection and only this connection once it was ready. The downside to this is, if it takes a while to establish that connection and another connection becomes available in the meantime, that request would continue waiting for the connection it spawned. The upside to this, however, is if the connection took a long time to be established (or took so long to be established that it timed out), the only request it would harm (again assuming no MaxConnectionsPerServer limit reached) is the one request with which it was associated. In .NET 6, this behavior was changed, improving the downside but harming the upside. When a request arrives and there’s no connection available (including one that’s pending), a new one is still immediately spawned, but it’s no longer associated with only that request and that request is no longer associated only with it. If another connection becomes available in the meantime, that newly-available connection can service the request. That improves the aforementioned downside. But now we have this pending connection that’s been spawned, and so when the next request arrives, it’ll see there’s already a connection pending and will try to use that one rather than spawning another. If this connection will never become usable for whatever reason, any requests that associate with it will effectively stall as well, such that this one request can now take out multiple requests rather than just one. The discussed ConnectTimeout change is an attempt to minimize the impact of this by lessening the default time that connection might be pending and thus lessening the number of requests that associate with it, but there are other ways to address this. Note as well that a relatively recent .NET 7 change made it so that when the connection eventually fails, it only fails the request that originally spawned it; if that request has already moved on or completed or been canceled, the connection’s failure ends up being silent (other than logging).

For example, we could change the pooling to avoid considering a pending connection as one that can be used for an incoming request; if a new request arrives and there’s no immediately usable connection to service the request, create a new one. If we’re worried about a backlog of pending connections to a bad server building up, we can have some internal limit on how many such pending connections are allowed. (If there is a MaxConnectionsPerServer and it’s been reached, then we wait just as we always did). Instead or in addition, we could also lower the connection timeout once the connection is no longer associated with the original request, since it’s not going to fail anyone else anyway.

Triage: we should address the issue with long OS timeouts in 7.0 in one way or another (retaining the old exception type, tuning OS timeouts, reworking pooling logic in some way).

SocketException as a subtype of Win32Exception, which is a type that indicates an occurrence of a low level error reported by the OS. Throwing such an exception when there is no low level error directly linked to the request is not what I would expect from a framework. This would look even weirder when ConnectCallback is overridden in a way that there are no sockets involved at all.

Which exception to throw to be consistent with our stack – let’s look at recent changes we did in the space and align with them. We may decide to follow that pattern (TaskCanceledException with inner TimeoutException), or fake SocketsException with inner TimeoutException, or start throwing directly TimetoutException.

We already follow the pattern “TaskCanceledException with inner TimeoutException” in both places: For full-request-timeout https://github.com/dotnet/runtime/blob/215b39abf947da7a40b0cb137eab4bceb24ad3e3/src/libraries/System.Net.Http/src/System/Net/Http/HttpClient.cs#L607-L612 and for connect timeout https://github.com/dotnet/runtime/blob/975aa549b689f10aa76addc259a0dba36419c14d/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs#L435-L443

There was a huge discussion in https://github.com/dotnet/runtime/issues/21965 on what is better to do on timeouts, which ended up introducing this specific pattern. In general, we didn’t want to change the exception type as historically TCE was thrown, and we just added a way to programmatically check whether it was a timeout (by putting TimeoutException inside)

Re mimicking SocketException, it is a bit funny given that we make sure we throw OCE and not SocketException on cancellation in Sockets 😄 https://github.com/dotnet/runtime/blob/975aa549b689f10aa76addc259a0dba36419c14d/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs#L120-L124 But there’s no way on Socket level to say that the token signifies the timeout though. I’m also not sure it would be ok to break existing behavior of ConnectTimeout, e.g. someone might depend that TCE was thrown on it?

would changing the ConnectTimeout to something lower than the OS’s timeout change the behavior so that we may no longer end up trying different IPs?

Only in case of Windows. In case of Linux, we’re not losing anything, since the OS timeout ends up being ~2 minutes. The user can always change it back to infinite if this proves problematic. They can also implement their own ConnectCallback.

Eventually we could do something like happy eyeballs and try to connect to multiple IPs in parallel (related https://github.com/dotnet/runtime/issues/26177).