go: x/build,cmd/go: frequent connection timeouts to github.com since 2022-03-04

#!watchflakes
post <- pkg ~ `cmd/go` && `github.com.*(Connection timed out|TLS connection)` && date < "2023-07-01"
# cd $WORK/tmp/d2/src/github.com/myitcv; git clone -- https://github.com/myitcv/vgo_example_compat $WORK/tmp/d2/src/github.com/myitcv/vgo_example_compat
            Cloning into '$WORK/tmp/d2/src/github.com/myitcv/vgo_example_compat'...
            fatal: unable to access 'https://github.com/myitcv/vgo_example_compat/': Failed to connect to github.com port 443: Connection timed out

greplogs --dashboard -md -l -e 'github.com.* Connection timed out' --since=2021-01-01

2022-04-22T19:02:29-808d40d/linux-386-longtest 2022-04-22T15:44:57-1899472/linux-386-longtest 2022-04-22T04:42:23-1e59876/linux-386-longtest 2022-04-22T00:25:08-c9031a4/linux-amd64-longtest 2022-04-22T00:05:54-c510cd9/linux-386-longtest 2022-04-22T00:05:54-c510cd9/linux-amd64-longtest 2022-04-19T15:07:49-caa4631/linux-amd64-longtest 2022-04-15T19:02:54-df08c9a/linux-386-longtest 2022-04-14T18:00:13-dd97871/linux-386-longtest 2022-04-14T18:00:13-dd97871/linux-amd64-longtest 2022-04-13T17:48:12-517781b/linux-386-longtest 2022-04-13T01:15:22-b55a2fb/linux-386-longtest 2022-04-12T22:35:20-fc1d4c1/linux-amd64-longtest 2022-04-12T22:32:01-6183920/linux-386-longtest 2022-04-12T05:46:57-2b31abc/linux-amd64-longtest 2022-04-11T16:31:45-e4e033a/linux-386-longtest 2022-04-11T16:31:43-036b615/linux-amd64-longtest 2022-04-11T16:31:33-494b79f/linux-amd64-longtest 2022-04-08T18:30:53-3a19102/linux-386-longtest 2022-03-15T17:04:57-9b112ce/linux-amd64-longtest 2022-03-07T13:47:51-0e2f1ab/linux-386-longtest 2022-03-04T20:02:41-2b8aa2b/linux-386-longtest [Note the 5-month gap here! šŸ¤”] 2021-10-08T14:08:12-5b9206f/linux-386-longtest 2021-10-08T14:08:12-5b9206f/linux-amd64-longtest 2021-09-07T21:39:06-dcf3545/linux-386-longtest 2021-09-07T20:37:05-8078355/linux-386-longtest 2021-09-07T20:27:30-23f4f0d/linux-386-longtest 2021-09-07T19:39:04-d92101f/linux-386-longtest 2021-06-22T02:44:43-197a5ee/linux-amd64-longtest 2021-04-02T05:24:14-aebc0b4/linux-amd64-longtest 2021-03-18T14:43:33-e726e2a/linux-amd64-longtest 2021-03-17T17:50:50-8628bf9/linux-amd64-longtest 2021-03-17T17:13:50-0bd308f/linux-amd64-longtest 2021-03-17T16:53:00-70d54df/linux-386-longtest 2021-03-17T16:53:00-70d54df/linux-amd64-longtest 2021-03-17T16:19:21-2f3db22/linux-386-longtest 2021-03-05T18:46:36-51d8d35/linux-amd64-longtest

@golang/release: were there changes to the linux-.*-longtest image or network configuration around 2022-03-04 that might explain the increase in timeouts at that point?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 37 (27 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve been able to reproduce this locally on a VM behind a Google Cloud NAT with a limit of 128 ports.

The root cause is a combination of having a low limit on the number of (statically-allocated or semi-static) NAT ports available to the host, and the tests making a lot of git requests (which always establish a new connection) repeatedly to the same github.com host.

Empirically, the socket for the github.com server does not end up in the TIME-WAIT state on the local machine, so it must be in the TIME-WAIT state on the server side. RFC-1122 says that a host MAY accept a SYN for a socket in the TIME-WAIT state, but only if the initial sequence number is larger than the largest sequence number used for the previous connection.

Perhaps something about the interaction between the VM and the NAT service causes that not to be the case here, or perhaps the Cloud NAT service refuses to reuse ports just in case the other host doesn’t support this optional behavior.

Either way, the NAT service ends up with all of the available ports in the host’s static range in the TIME-WAIT state on the server side, and something (either the NAT service or the GitHub server itself) drops the connections until the TIME_WAIT interval expires.

The general solution is to increase the number of ports available to the host, either by expanding the per-host static port range (as Heschi did) or by using dynamic NAT port assignments instead.

I suspect it could also be remedied by setting up a connection-pooling HTTPS proxy on the VM.

And, of course, we could reduce the number of times we connect to the same external hosts in tests. šŸ˜…

In the course of chasing something else it occurred to me that this might be NAT port exhaustion. GCE monitoring does show low levels of packet drops due to OUT_OF_RESOURCES. I’ve doubled the per-VM port allocation from 256 to 512, let’s see if that helps.

Hmm, but as a diagnostic I wonder if GIT_TRACE_CURL_NO_DATA and/or GIT_CURL_VERBOSE in the script environment would help? I’ll give that a try, and we can see if it yields anything useful for debugging.

Not that I’m aware of.