kubernetes: Low nf_conntrack_tcp_timeout_close_wait default causes intermittent ECONNREFUSED on GKE

Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“3”, GitVersion:“v1.3.5”, GitCommit:“b0deb2eb8f4037421077f77cb163dbb4c0a2a9f5”, GitTreeState:“clean”, BuildDate:“2016-08-11T20:29:08Z”, GoVersion:“go1.6.2”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“3”, GitVersion:“v1.3.5”, GitCommit:“b0deb2eb8f4037421077f77cb163dbb4c0a2a9f5”, GitTreeState:“clean”, BuildDate:“2016-08-11T20:21:58Z”, GoVersion:“go1.6.2”, Compiler:“gc”, Platform:“linux/amd64”}

Environment:

Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release): PRETTY_NAME=“Debian GNU/Linux 7 (wheezy)”
Kernel (e.g. uname -a):3.16.0-4-amd64

What happened:

We’re seeing plenty of “connection refused” errors when connecting to the GCE metadata server from within pods in our GKE cluster. These connections are mostly from OAuth2 client libraries (in Go and Python) trying to fetch access token when connecting to any GCP service. However as explained below, this can happen to connections to other services as well.

Example (this one from Cloud SQL proxy):

2016/09/08 06:37:10 couldn't connect to "[project]:us-central1:[db-name]": Post https://www.googleapis.com/sql/v1beta4/projects/[project]/instances/[db-name]/createEphemeral?alt=json: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: getsockopt: connection refused

What you expected to happen:

No intermittent failures.

How to reproduce it (as minimally and precisely as possible):

Create plenty of TCP connections from pods to a single host:port. Leave the connections half-closed by the remote end, do not close it from the other side.
Let these CLOSE_WAIT connections expire from node conntrack tables.
Create more such connections, and eventually node’s netfilter will reuse a NAT source port which is still in use in the remote end’s mind. Such connections will be reset by the remote end, resulting in “connection refused” errors on the userspace side.

Anything else do we need to know:

TL;DR: kernel’s default nf_conntrack_tcp_timeout_close_wait of 60s is too low: http://marc.info/?l=netfilter-devel&m=117568928824030&w=2

I caught this with tcpdump on our GKE cluster. Here it is, with additional lines (the ones with 5-second resolution) from conntrack(1) outputs for the interesting 5-tuple. Actors:

10.192.2.35 is a pod IP
10.241.0.49 is the node IP
169.254.169.254 is the GCE metadata server

pod 10.192.2.35 starts a connection, it’s NATed to node source port 35715

09:13:55.069526 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [S], seq 1235205249, win 29200, options [mss 1460,sackOK,TS val 148646959 ecr 0,nop,wscale 7], length 0
09:13:55.069696 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [S.], seq 3311972373, ack 1235205250, win 65535, options [mss 1420,eol], length 0
09:13:55.069742 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 1, win 29200, length 0

connection is established. HTTP request:

09:13:55.069982 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [P.], seq 1:216, ack 1, win 29200, length 215
09:13:55.070112 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 216, win 65320, length 0
09:13:55.070127 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 216, win 65535, length 0

HTTP response:

09:13:55.070411 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [P.], seq 1:510, ack 216, win 65535, length 509
09:13:55.070441 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 510, win 30016, length 0

another HTTP request:

09:13:55.071634 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [P.], seq 216:464, ack 510, win 30016, length 248
09:13:55.071731 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 464, win 65287, length 0
09:13:55.071763 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 464, win 65535, length 0

and another HTTP response:

09:13:55.071936 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [P.], seq 510:955, ack 464, win 65535, length 445
09:13:55.108816 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 955, win 31088, length 0

time passes, conntrack sees the connection as established:

09:13:59    tcp  6 86395 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:04    tcp  6 86390 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:09    tcp  6 86385 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:14    tcp  6 86380 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:19    tcp  6 86375 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:24    tcp  6 86370 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1

after 30s of inactivity metadata server closes its side of the connection

09:14:25.073869 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [F.], seq 955, ack 464, win 65535, length 0

and client acks that:

09:14:25.112831 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 956, win 31088, length 0

but does not close its side of connection yet

time passes, and conntrack sees the connection as half-closed (a.k.a. close_wait):

09:14:29    tcp  6 55 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:34    tcp  6 50 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:40    tcp  6 44 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:45    tcp  6 39 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:50    tcp  6 34 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:55    tcp  6 29 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:00    tcp  6 24 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:05    tcp  6 19 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:10    tcp  6 14 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:15    tcp  6 9 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:21    tcp  6 4 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1

after a minute, though (nf_conntrack_tcp_timeout_close_wait sysctl knob’s default), it loses patience and drops the entry from the conntrack table. As a result, the connection is no longer NATed any more, and when the client wakes up 2.5 minutes later to close the connection, it shows up with its non-NATed source IP:

09:17:00.019964 IP 10.192.2.35.35715 > 169.254.169.254.80: Flags [F.], seq 1235205713, ack 3311973329, win 31088, length 0

obviously the metadata server TCP stack explains there is no connection for such 5-tuple, with an RST bit:

09:17:00.020053 IP 169.254.169.254.80 > 10.192.2.35.35715: Flags [R], seq 3311973329, win 65535, length 0

So at this point the “TCP 10.241.0.49.35715 > 169.254.169.254.80” connection is still open in metadata server’s mind, since the FIN above did not change anything.

Therefore 59 minutes later, when the node happily maps a completely different connection (made from a different pod) to the same source port:

10:14:14.639439 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [S], seq 301038175, win 29200, options [mss 1460,sackOK,TS val 149551851 ecr 0,nop,wscale 7], length 0

However since the metadata server still remembers the old connection, and the sequence numbers and the state of the connection do not match, so it rejects the connection attempt with an RST:

10:14:14.639520 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [R.], seq 982994923, ack 3360800223, win 65535, length 0

This is seen in userspace as “connection refused”.

Interestingly, just ten minutes later there is yet another connection attempt, NATed to the same source port, and at this point apparently the metadata server no longer remembers the old connection, as it accepts the new one.

This suggests the CLOSE_WAIT lifetime on metadata server is between 59 and 69 minutes, probably 60.

10:24:18.059755 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [S], seq 3149308939, win 29200, options [mss 1460,sackOK,TS val 149702706 ecr 0,nop,wscale 7], length 0
10:24:18.060016 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [S.], seq 146596554, ack 3149308940, win 65535, options [mss 1420,eol], length 0
10:24:18.060075 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 1, win 29200, length 0
10:24:18.060180 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [P.], seq 1:193, ack 1, win 29200, length 192
10:24:18.060375 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 193, win 65343, length 0
10:24:18.060392 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 193, win 65535, length 0
10:24:18.060661 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [P.], seq 1:271, ack 193, win 65535, length 270
10:24:18.060724 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 271, win 30016, length 0
10:24:18.061528 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [F.], seq 193, ack 271, win 30016, length 0
10:24:18.061647 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 194, win 65535, length 0
10:24:18.061674 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [F.], seq 271, ack 194, win 65535, length 0
10:24:18.061695 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 272, win 30016, length 0

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 23 (21 by maintainers)

Commits related to this issue

Merge pull request #35454 from bowei/image-for-issue-32551 Automatic merge from submit-queue Add test image for networking related tests This test image is to support the e2e test for #32551 — committed to jessfraz/kubernetes by deleted user 8 years ago
Merge pull request #35919 from bowei/issue-32551 Automatic merge from submit-queue Adds TCPCloseWaitTimeout option to kube-proxy for sysctl nf_conntrack… Adds TCPCloseWaitTimeout option to kube-pro... — committed to kubernetes/kubernetes by deleted user 8 years ago

Most upvoted comments

Some notes on reproducing the behavior, as the Linux NAT logic seems to be conditional on the flags present in the packet itself:

On the node (10.240.0.4), set the CLOSE_WAIT timeout to something small:

$  sudo sysctl net.netfilter.nf_conntrack_tcp_timeout_close_wait=10

Connect to the metadata server from within the pod (10.180.3.24), and hold on to the socket:

$ cat > /tmp/test.py
import socket

local = ('10.180.3.24', 7777) # pod local ip
remote = ('169.254.169.254', 80) # metadata server

sock = socket.socket()
sock.bind(local)

sock.connect(remote)
sock.send('GET / HTTP/1.1\r\nConnection: close\r\n\r\n')

$ python -i /tmp/test.py

Wait for the conntrack entry to expire (it will be in CLOSE_WAIT).

Now if we close the socket, we will see that the last packet (it happens to be RST in this case) will NOT be NAT’d:

>>> sock.close()

TCPDUMP output for the interaction

18:35:32.316639 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [S], seq 887387349, win 28400, options [mss 1420,sackOK,TS val 81361042 ecr 0,nop,wscale 7], length 0
18:35:32.316859 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [S.], seq 2737060503, ack 887387350, win 65535, options [mss 1420,eol], length 0
18:35:32.316946 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [.], ack 1, win 28400, length 0
18:35:32.317259 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [P.], seq 1:38, ack 1, win 28400, length 37
...
## DOES NOT GET NAT'd
18:36:00.832768 IP 10.180.3.24.7777 > 169.254.169.254.80: Flags [R.], seq 887387387, ack 2737152397, win 65320, length 0
                   ^^^^^^^^^^^

Subsequent run of test.py will get a RST:

18:36:10.319033 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [S], seq 1481175026, win 28400, options [mss 1420,sackOK,TS val 81399045 ecr 0,nop,wscale 7], length 0
18:36:10.319215 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [R.], seq 1557906793, ack 593787678, win 65535, length 0

However, if we send data instead of close(), the entry is recreated and the packet does get NAT’d.

>>> sock.send('x')

Tcpdump output:

# Send data
18:54:03.918358 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [P.], seq 38:39, ack 91894, win 65320, length 1
                   ^^^^^^^^^^
18:54:03.918532 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [.], ack 39, win 65534, length 0
18:54:03.918572 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [.], ack 39, win 65535, length 0
# This is invalid for HTTP, so server drops us by sending up a bunch of RST
18:54:03.918732 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [R.], seq 39, ack 91894, win 65320, length 0
18:54:03.918758 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [R], seq 953376519, win 65535, length 0
18:54:03.918787 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [R], seq 953376519, win 65535, length 0
18:54:03.918853 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [R], seq 953376519, win 65535, length 0

(Not) reproducing in our cluster

When trying to recreate the situation inside our cluster, I observed that when a SYN is sent for an stale 5-tuple between our GCI machines (i.e. an initial SYN from machine A to B, B is holding a connection from A that does not exist anymore)

The response is not RST, but an ACK from the B->A which results in an RST A->B. The subsequent SYN retry from A->B succeeds. In this case, the connect() from A does not have an error.

bowei on Oct 20, 2016