moby: Upgrade to 20.10 breaks swarm network
Description
Steps to reproduce the issue:
- Install Docker 19.03 on Ubuntu 20 or CentOS 8
- Init Swarm
- Start some services by docker stack deploy
- Upgrade docker from 19.03 to 20.10
Describe the results you received: Containers of services can’t start there is error in logs
Dec 10 06:21:03 dockerd[3160859]: time="2020-12-10T06:21:03.150920367Z" level=error msg="fatal task error" error="starting container failed: container 9f93a21ac2e3be11a65c91f3cfde555a415eea47c636bef432d5d2e4b08afff4: endpoint create on GW Network failed: failed to create endpoint gateway_f8cabe848464 on network docker_gwbridge: network 28d599d44202f2acdc85e42437332ddb41a81bd7f0622bc0724761ec9b49082a does not exist" module=node/agent/taskmanager node.id=u7qdqny1doho69k3nariuo1ru service.id=vhtg6aoyt360k7mluiwmshqf0 task.id=aqgsfcw88m5kujnrly74o4wh4
Describe the results you expected: containers are running
Additional information you deem important (e.g. issue happens only occasionally): we have two installations with this issue which happened after upgrade to 20.10 recreating services didn’t help re-initing swarm didn’t help
# docker network list
NETWORK ID NAME DRIVER SCOPE
e45b9b63c4ae bridge bridge local
28d599d44202 docker_gwbridge bridge local
2aa80dc0cc04 host host local
w9rpuika2x0d ingress overlay swarm
62f0fb2fdf28 none null local
Output of docker version
:
Client: Docker Engine - Community
Version: 20.10.0
API version: 1.41
Go version: go1.13.15
Git commit: 7287ab3
Built: Tue Dec 8 18:59:40 2020
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.0
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: eeddea2
Built: Tue Dec 8 18:57:45 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Output of docker info
:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.4.2-docker)
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 12
Server Version: 20.10.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: u7qdqny1doho69k3nariuo1ru
Is Manager: true
ClusterID: rdq2vi44m2lkz34tdow1dvip4
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 127.0.0.1
Manager Addresses:
127.0.0.1:2377
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-56-generic
Operating System: Ubuntu 20.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.749GiB
Name: cloud.filesanctuary.net
ID: JBWW:XVUE:3XW4:OQYT:HJHK:OSRV:PFHK:PFZP:S3DV:HPZ7:NYWD:OWQO
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 11
- Comments: 65 (4 by maintainers)
We also have connectivity problems in our docker swarm (3 redhat 8.3 vm nodes) The services running in the containers are not accessible using the swarm mode routing mesh but only using the explicit host ip
After some investigation, we found that the problem is related to the 4789 udp packets that docker uses to manage the requests in the swarm: these packets are dropped by the source node and they never reach the destinatation node
To resolve this issue we had to disable the following offload feature:
ethtool -K [network] tx-checksum-ip-generic off
update: similar problem https://github.com/flannel-io/flannel/issues/1279
I can confirm this is exactly the solution for at least my case (centos 8.3 stream, docker 20.10.5) on VMware esxi 6.7
after executing on all swarm machines, routing mesh now works! This seems to be reboot-safe
cheers @txtdevelop !
Edit/ps: it may be reboot-safe, but after recent
dnf update
the setting was lost again. For anyone needing it: (ETHTOOL_OPTS=
seems not recognized in centos8-stream when using NM)makes it persistent
check:
Hello everyone,
I was having the exact same issue for a swarm cluster, buit of Ubuntu 20.04.4 LTS VMs, running on ESXi 6.7. I spent countless hours troubleshooting it. My main focus was iptables, since it made most sense to me.
However, in my case, running the command below on all cluster nodes immediately fixed my problem. Now, ingress publishing works like a charm!
sudo ethtool -K <interface> tx-checksum-ip-generic off
It’s worth trying!
Best regards, Ivan Spasov
I was trying to setup a Swarm over a Hetzner private network (using a vSwitch).
Mesh routing not working, I could only make it work in global / host mode. Tried anything with the firewall and the
ethtool
workarounds listed above, tried to change linux distro (Almalinux 8 and Debian 11), had zero luck.Then I found this comment on Reddit which quite saved my life.
So, if you can’t get Swarm working over a Hetzner network and you already tried everything, check your MTUs: you need to adjust Docker networks MTU so that it’s lower or equal 1450, which is Hetzner vlan MTU.
We’ve run into this issue as well. The strange thing is we had two essentially identical environments, one has the issue, the other works fine.
These are the package versions we’re using, but it’s probably not that since one environment has this and it works.
The symptom is that the overlay network doesn’t work. The way to test this is with tcpdump:
When it’s broken you only see packets going out, but no packets coming in. You need to have some containers running to trigger overlay traffic.
In our case we needed to do
on all swarm hosts. We added it to /etc/network/interfaces pre-up to fix this.
The only difference we’ve found is that one environment is VM version 13 (which works) and the other is VM version 14 (which doesn’t). We’ve found reference to a VMWare PR 2766401 which refers to a bug causing the vmxnet driver to drop packets. This is apparently fixed from VM version 15.
So our hypothesis is that if you running VM version 14 with the Debian 5.10.92-2 kernel it breaks, but running an older kernel version (in our case 4.19.98-1+deb10u1) or an older VM version, it works fine.
For reference, these versions worked everywhere for us (prior to upgrading)
For reference, we were running VMware ESXi, 7.0.3, 19193900 on both environments.
Encountering this same issue, with the caveat that the tx-checksum-ip-generich off fix doesn’t seem to work for me
Works for us on the Docker Swarm worker node with CentOS 8.3 and Docker 20.10.5 Thank you @sgohl
I use swarm and I had network connectivity issues right after migrating to docker 20.10.x. After struggling a bit, I was able to find out the problem and to fix it.
I use overlay networks for my swarm services and it’s very common that my services are defined in several networks. So it basically means that my service have several ips (one for each of its network).
In the example below, my
nginx
has one hostname (server1
) but at least 2 ips (ip1
in networknet1
andip2
in networknet2
).Now, here comes the interesting part: docker 19.03.x and docker 20.10.x behave differently when it comes to resolve the ip of the host
server1
.docker 19.03.x ALWAYS returns the same ip (which can be either ip1 or ip2 in my above example)
whereas docker 20.10.x returns ALTERNATIVELY ip1 and ip2 (round-robin).
Now my problem was that I was using hostnames in my services and then I was using GO primitives such as ResolveTCPAddr to get the ip and connect to other services and I used lib such as pollon (https://github.com/sorintlab/pollon/blob/248c68238c160c056cd51d0e782276cef5c64ce4/pollon.go#L130) to track ip changes and reinit connections each time an ip change was detected…
So, since docker 20.10 is now returning a different ip after each DNS request when services have several ips, I was endlessly losing connection…
After realizing this, I had to modify the code of my services to take into account this new behavior.
I don’t know if what I’m describing here could be related to this issue. I’m just giving my feedback of the issues I had during my docker migration in case it may help someone…
I’ve failed to reproduce using unencrypted networks (and successfully reproduced using encrypted networks). I think it is safe to conclude that there is a correlation between encrypted networks and the Kernel update. I’ve attached a simplified stack.yaml file for reference.
I’ve created a new issue (https://github.com/moby/moby/issues/43443) to prevent confounding the two potentially different issues here. If it is later determined to be the same issue, we can rejoin them then.
We recently encountered a similar issue on Azure, i.e. containers could ping containers on other nodes, but other traffic would hang indefinitely (e.g. curl, mysql, etc.). Our issue was resolved by a different solution than those presented in this thread, so I’m posting it here for completeness/awareness.
We RCA’d our issue to an Ubuntu Kernel update, specifically;
Downgrading the Kernel to 5.4.0-1072 (shown removing the 1073 version) restored cross-node container connectivity,
We initially thought the connectivity loss was related to a Docker Swarm upgrade (specifically to 20.10). However, we later determined that it wasn’t the Docker upgrade at all – it was the reboot that we performed while doing it (which loaded the new Kernel on our test environments).
Edit – the new Kernel (1073) went live earlier this week (2022-03-22*)
Seems to be a problem only with VMWARE virtual NIC when used with VMXNET3 driver.
See : https://mails.dpdk.org/archives/dev/2018-September/111646.html
Hi everyone, same problem here.
I’m trying to reach a service on port 2002 exposed through an overlay network on my swarm cluster. But impossible when going through localhost, at the oposite it works if targeting a remote node.
failure
The
tx-checksum-ip-generic off
trick DOES work, but I do not want to use it as it’s not normal to have to use it.IMPORTANT
linux-image-4.19.0-17-amd64 4.19.194-3
Thank you for your work !!!
@RicardoViteriR Thank you, this is the only solution that helped me!