moby: Idle connections over overlay network ends up in a broken state after 15 minutes
Description
In a swarm setup using overlay networks, idle connections between 2 services will end up in a broken state after 15 minutes.
The issue is related to the way docker overlay routes packets, using first iptables to mark them and use ipvs to forward them to the right hosts but the default expiration for connections on ipvs is set to 900 seconds (ipvsadm -l --timeout
) after which it will stop forwarding packets even though the connection still exists; If this happens then any new packet on this connection will now try to go to the virtual IP for that service that has no valid resolution, resulting in a broken state where it is stuck in limbo while the kernel forever tries to resolve that virtual IP.
Steps to reproduce the issue:
- Start 2 services on the same network (on different hosts, though it should be reproducible even on a single host?)
docker exec
in both of them, in one start anc
command in listen mode, in the other one connect to thatnc
server by using the service name DNS.- Send a packet from the client to the server, everything is fine
- Find your
netns
and find your connection by doingnsenter --net=2cc18e502f81 ipvsadm -lnc
- Wait for the connection to expire and be removed from the list
- Send another packet, nothing ever gets there and the connection doesn’t timeout,
tcpdump
shows lots of ARP packets going out
Describe the results you received:
Packet never reaches the target, kernel is stuck doing ARP requests over and over.
Describe the results you expected:
Either have the connection properly timeout, or find a way to restore the routing in ipvs.
Additional information you deem important (e.g. issue happens only occasionally):
Currently can be resolved by setting net.ipv4.tcp_keepalive_time
to less than 900 seconds, to make sure the TCP connection doesn’t expire but I’m not sure if it’s a valid way to deal with this; At the very least this behavior should be documented.
Output of docker version
:
Client:
Version: 1.13.1
API version: 1.26
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:38:28 2017
OS/Arch: linux/amd64
Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:38:28 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 2
Server Version: 1.13.1
Storage Driver: overlay
Backing Filesystem: xfs
Supports d_type: true
Logging Driver: fluentd
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: l3e2evjei4cvcdgjqavtrztgo
Is Manager: false
Node Address: 172.24.0.100
Manager Addresses:
172.24.0.200:2377
172.24.0.50:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-514.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.796 GiB
Name: worker-1
ID: DR4G:LZEQ:YSQ7:CYTR:FAXW:ZNVJ:E4AZ:BX5L:QYYG:ZDY5:SO7U:TFZW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Labels:
dawn.node.type=worker
dawn.node.subtype=app
Experimental: false
Insecure Registries:
172.24.0.50:5000
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
My current test setup is 5 vagrant boxes (2 managers + 3 workers), but it should happen in any environment.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 26
- Comments: 18 (4 by maintainers)
Commits related to this issue
- Idle connections over overlay network ends up in a broken state after 15 minutes xref https://github.com/moby/moby/issues/31208 — committed to dyrnq/kubeadm-vagrant by dyrnq 3 years ago
@GabKlein My current setup uses the following:
I took the values from https://access.redhat.com/solutions/23874 and tweaked them slightly for our setup. Didn’t run in the issue since then.
To check if it’s working you can use
nsenter
andipvsadm
to take a look at your connections and check if they are being pinged properly (see this article for details on how to do that)Please refer to: https://github.com/moby/moby/issues/37466#issuecomment-405307656 and https://success.docker.com/article/ipvs-connection-timeout-issue
This problem is due to the kernel module IPVS. Look at this line: https://github.com/torvalds/linux/blob/master/net/netfilter/ipvs/ip_vs_proto_tcp.c#L366
I changed the
IP_VS_TCP_S_ESTABLISHED
timeout from 900 to a larger value, recompile the module and reloadip_vs
andip_vs_rr
kernel modules, this problem is gone. (maybe reload justip_vs
is also fine, not tested)Compare with the following default kernel parameters, the
IP_VS_TCP_S_ESTABLISHED
value of IPVS is obviously too small!On the other side, tuning kernel parameters like
net.ipv4.tcp_keepalive_timeout
does not work for me. Even using the default values, I cannot capture tcp keepalive packages when it should. And thus the connection will always be dropped/reset by IPVS eventually. I think it is due to my application. Because even though kernel support TCP keepalive, the application has to be set up properly. See http://www.tldp.org/HOWTO/TCP-Keepalive-HOWTO/programming.htmlI’m using ansible to provision my server and it stores the variables in a file in
/etc/sysctl.d
.If you are not rebooting you can create the files and run
sysctl --system
to reload all configuration files, it will also tell you what was loaded in which order so you can see if anything else might be overriding your config.@christopherobin How did you manage to work around this issue? I’m having the problem between my app that create a pool and my db. After 15 minute being idle the app is not able to reconnect. I tried adding a sysctl file
echo "net.ipv4.tcp_keepalive_time = 60" > /etc/sysctl.d/60-keepalive.conf
without success. My my app is still hanging after 15 minutes being idle 😕@christopherchines With IPVS or any other man in the middle NAT/Firewall the TCP keep-alive timer has to be tuned when you have “silent” long lived sessions. I will add a note about this in the documentation.
The connection would have got terminated if the TCP packet was delivered to a different backend and resulted in a RST from that backend. But I guess whats happening here is after the initial session expires, when IPVS gets a TCP packet that is not SYN its dropping it and not sending it to backend. This makes sense because its a new TCP session for IPVS and doesn’t have SYN bit set.