go: x/build,cmd/go: frequent connection timeouts to github.com since 2022-03-04
#!watchflakes
post <- pkg ~ `cmd/go` && `github.com.*(Connection timed out|TLS connection)` && date < "2023-07-01"
# cd $WORK/tmp/d2/src/github.com/myitcv; git clone -- https://github.com/myitcv/vgo_example_compat $WORK/tmp/d2/src/github.com/myitcv/vgo_example_compat
Cloning into '$WORK/tmp/d2/src/github.com/myitcv/vgo_example_compat'...
fatal: unable to access 'https://github.com/myitcv/vgo_example_compat/': Failed to connect to github.com port 443: Connection timed out
greplogs --dashboard -md -l -e 'github.com.* Connection timed out' --since=2021-01-01
2022-04-22T19:02:29-808d40d/linux-386-longtest 2022-04-22T15:44:57-1899472/linux-386-longtest 2022-04-22T04:42:23-1e59876/linux-386-longtest 2022-04-22T00:25:08-c9031a4/linux-amd64-longtest 2022-04-22T00:05:54-c510cd9/linux-386-longtest 2022-04-22T00:05:54-c510cd9/linux-amd64-longtest 2022-04-19T15:07:49-caa4631/linux-amd64-longtest 2022-04-15T19:02:54-df08c9a/linux-386-longtest 2022-04-14T18:00:13-dd97871/linux-386-longtest 2022-04-14T18:00:13-dd97871/linux-amd64-longtest 2022-04-13T17:48:12-517781b/linux-386-longtest 2022-04-13T01:15:22-b55a2fb/linux-386-longtest 2022-04-12T22:35:20-fc1d4c1/linux-amd64-longtest 2022-04-12T22:32:01-6183920/linux-386-longtest 2022-04-12T05:46:57-2b31abc/linux-amd64-longtest 2022-04-11T16:31:45-e4e033a/linux-386-longtest 2022-04-11T16:31:43-036b615/linux-amd64-longtest 2022-04-11T16:31:33-494b79f/linux-amd64-longtest 2022-04-08T18:30:53-3a19102/linux-386-longtest 2022-03-15T17:04:57-9b112ce/linux-amd64-longtest 2022-03-07T13:47:51-0e2f1ab/linux-386-longtest 2022-03-04T20:02:41-2b8aa2b/linux-386-longtest [Note the 5-month gap here! š¤] 2021-10-08T14:08:12-5b9206f/linux-386-longtest 2021-10-08T14:08:12-5b9206f/linux-amd64-longtest 2021-09-07T21:39:06-dcf3545/linux-386-longtest 2021-09-07T20:37:05-8078355/linux-386-longtest 2021-09-07T20:27:30-23f4f0d/linux-386-longtest 2021-09-07T19:39:04-d92101f/linux-386-longtest 2021-06-22T02:44:43-197a5ee/linux-amd64-longtest 2021-04-02T05:24:14-aebc0b4/linux-amd64-longtest 2021-03-18T14:43:33-e726e2a/linux-amd64-longtest 2021-03-17T17:50:50-8628bf9/linux-amd64-longtest 2021-03-17T17:13:50-0bd308f/linux-amd64-longtest 2021-03-17T16:53:00-70d54df/linux-386-longtest 2021-03-17T16:53:00-70d54df/linux-amd64-longtest 2021-03-17T16:19:21-2f3db22/linux-386-longtest 2021-03-05T18:46:36-51d8d35/linux-amd64-longtest
@golang/release: were there changes to the linux-.*-longtest
image or network configuration around 2022-03-04 that might explain the increase in timeouts at that point?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 37 (27 by maintainers)
Commits related to this issue
- cmd/go: add timestamps to script test output Go tests don't include timestamps by default, but we would like to have them in order to correlate builder failures with server and network logs. Since m... — committed to golang/go by bcmills 2 years ago
- cmd/go: set GIT_TRACE_CURL for tests on builders We have noticed a pattern of connection timeouts connecting to github.com on the builders. Adding tracing may shed some light on the underlying cause.... — committed to golang/go by bcmills 2 years ago
- cmd/go: declare net hosts in script tests Although we aren't precise about enforcing the hosts just yet, we can eventually use the declared hostnames to selectively skip tests (for example, if an ext... — committed to golang/go by bcmills a year ago
- cmd/go: skip the 'git' part of get_insecure_redirect in short mode Invoking 'git' adds about 200ms to this test on a fast machine, probably more on a slow one. (As a general habit we skip the 'git' t... — committed to golang/go by bcmills a year ago
- cmd/go: add a GODEBUG to limit the number of concurrent network connections I implemented this in order to debug connection failures on a new-to-me VM development environment that uses Cloud NAT. It ... — committed to golang/go by bcmills a year ago
Iāve been able to reproduce this locally on a VM behind a Google Cloud NAT with a limit of 128 ports.
The root cause is a combination of having a low limit on the number of (statically-allocated or semi-static) NAT ports available to the host, and the tests making a lot of
git
requests (which always establish a new connection) repeatedly to the samegithub.com
host.Empirically, the socket for the
github.com
server does not end up in theTIME-WAIT
state on the local machine, so it must be in theTIME-WAIT
state on the server side. RFC-1122 says that a host MAY accept aSYN
for a socket in theTIME-WAIT
state, but only if the initial sequence number is larger than the largest sequence number used for the previous connection.Perhaps something about the interaction between the VM and the NAT service causes that not to be the case here, or perhaps the Cloud NAT service refuses to reuse ports just in case the other host doesnāt support this optional behavior.
Either way, the NAT service ends up with all of the available ports in the hostās static range in the
TIME-WAIT
state on the server side, and something (either the NAT service or the GitHub server itself) drops the connections until theTIME_WAIT
interval expires.The general solution is to increase the number of ports available to the host, either by expanding the per-host static port range (as Heschi did) or by using dynamic NAT port assignments instead.
I suspect it could also be remedied by setting up a connection-pooling HTTPS proxy on the VM.
And, of course, we could reduce the number of times we connect to the same external hosts in tests. š
In the course of chasing something else it occurred to me that this might be NAT port exhaustion. GCE monitoring does show low levels of packet drops due to
OUT_OF_RESOURCES
. Iāve doubled the per-VM port allocation from 256 to 512, letās see if that helps.Hmm, but as a diagnostic I wonder if
GIT_TRACE_CURL_NO_DATA
and/orGIT_CURL_VERBOSE
in the script environment would help? Iāll give that a try, and we can see if it yields anything useful for debugging.Not that Iām aware of.