moby: Idle connections over overlay network ends up in a broken state after 15 minutes

Description

In a swarm setup using overlay networks, idle connections between 2 services will end up in a broken state after 15 minutes.

The issue is related to the way docker overlay routes packets, using first iptables to mark them and use ipvs to forward them to the right hosts but the default expiration for connections on ipvs is set to 900 seconds (ipvsadm -l --timeout) after which it will stop forwarding packets even though the connection still exists; If this happens then any new packet on this connection will now try to go to the virtual IP for that service that has no valid resolution, resulting in a broken state where it is stuck in limbo while the kernel forever tries to resolve that virtual IP.

Steps to reproduce the issue:

  1. Start 2 services on the same network (on different hosts, though it should be reproducible even on a single host?)
  2. docker exec in both of them, in one start a nc command in listen mode, in the other one connect to that nc server by using the service name DNS.
  3. Send a packet from the client to the server, everything is fine
  4. Find your netns and find your connection by doing nsenter --net=2cc18e502f81 ipvsadm -lnc
  5. Wait for the connection to expire and be removed from the list
  6. Send another packet, nothing ever gets there and the connection doesn’t timeout, tcpdump shows lots of ARP packets going out

Describe the results you received:

Packet never reaches the target, kernel is stuck doing ARP requests over and over.

Describe the results you expected:

Either have the connection properly timeout, or find a way to restore the routing in ipvs.

Additional information you deem important (e.g. issue happens only occasionally):

Currently can be resolved by setting net.ipv4.tcp_keepalive_time to less than 900 seconds, to make sure the TCP connection doesn’t expire but I’m not sure if it’s a valid way to deal with this; At the very least this behavior should be documented.

Output of docker version:

Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:38:28 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:38:28 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 1.13.1
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: fluentd
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: l3e2evjei4cvcdgjqavtrztgo
 Is Manager: false
 Node Address: 172.24.0.100
 Manager Addresses:
  172.24.0.200:2377
  172.24.0.50:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.796 GiB
Name: worker-1
ID: DR4G:LZEQ:YSQ7:CYTR:FAXW:ZNVJ:E4AZ:BX5L:QYYG:ZDY5:SO7U:TFZW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Labels:
 dawn.node.type=worker
 dawn.node.subtype=app
Experimental: false
Insecure Registries:
 172.24.0.50:5000
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

My current test setup is 5 vagrant boxes (2 managers + 3 workers), but it should happen in any environment.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 26
  • Comments: 18 (4 by maintainers)

Commits related to this issue

Most upvoted comments

@GabKlein My current setup uses the following:

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 10

I took the values from https://access.redhat.com/solutions/23874 and tweaked them slightly for our setup. Didn’t run in the issue since then.

To check if it’s working you can use nsenter and ipvsadm to take a look at your connections and check if they are being pinged properly (see this article for details on how to do that)

This problem is due to the kernel module IPVS. Look at this line: https://github.com/torvalds/linux/blob/master/net/netfilter/ipvs/ip_vs_proto_tcp.c#L366

I changed the IP_VS_TCP_S_ESTABLISHED timeout from 900 to a larger value, recompile the module and reload ip_vs and ip_vs_rr kernel modules, this problem is gone. (maybe reload just ip_vs is also fine, not tested)

Compare with the following default kernel parameters, the IP_VS_TCP_S_ESTABLISHED value of IPVS is obviously too small!

net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300

On the other side, tuning kernel parameters like net.ipv4.tcp_keepalive_timeout does not work for me. Even using the default values, I cannot capture tcp keepalive packages when it should. And thus the connection will always be dropped/reset by IPVS eventually. I think it is due to my application. Because even though kernel support TCP keepalive, the application has to be set up properly. See http://www.tldp.org/HOWTO/TCP-Keepalive-HOWTO/programming.html

I’m using ansible to provision my server and it stores the variables in a file in/etc/sysctl.d.

If you are not rebooting you can create the files and run sysctl --system to reload all configuration files, it will also tell you what was loaded in which order so you can see if anything else might be overriding your config.

@christopherobin How did you manage to work around this issue? I’m having the problem between my app that create a pool and my db. After 15 minute being idle the app is not able to reconnect. I tried adding a sysctl file echo "net.ipv4.tcp_keepalive_time = 60" > /etc/sysctl.d/60-keepalive.conf without success. My my app is still hanging after 15 minutes being idle 😕

@christopherchines With IPVS or any other man in the middle NAT/Firewall the TCP keep-alive timer has to be tuned when you have “silent” long lived sessions. I will add a note about this in the documentation.

The connection would have got terminated if the TCP packet was delivered to a different backend and resulted in a RST from that backend. But I guess whats happening here is after the initial session expires, when IPVS gets a TCP packet that is not SYN its dropping it and not sending it to backend. This makes sense because its a new TCP session for IPVS and doesn’t have SYN bit set.