moby: Docker swarm load balancing not working over private network
Description
Problem is probably similar to #25325. Docker can’t reach containers from hostB when I query hostA public address.
I’m using Docker swarm with 2 hosts, they are connected via wireguard tunnel and are reachable to each other. I’m able to ping those hosts from each other using internal addresses.
Then I initialize swarm mode using --advertise-addr
, --data-path-addr
and --listen-addr
options, also stated internal addresses there. Hosts are visible via docker node ls
, both active. No errors in syslog.
But when I create service with 2 replicas, I’m facing strange behavior, accessing service via one of public IPs, I’m able to reach only containers which are running on this particular node. Other requests fail with timeout.
Steps to reproduce the issue:
- Setup wireguard tunnel, check that it works fine.
- Setup docker in swarm mode.
- Run a serivce. I’m using this one: agrrh/dummy-service-py. It runs HTTP service on port 80 and answers with container’s hostname + random uuid.
- Scale service at least with 2 replicas. (
docker service create --name dummy --replicas 2 --publish 8080:80 agrrh/dummy-service-py
) - Try to cycle through replicas querying HostA address.
Describe the results you received:
As I said, requests to containers on other nodes fail:
$ http host1:port
{ "hostname": "containerA" } # this container running at host1
$ http host1:port
http: error: Request timed out (30.0s).
$ http host2:port
http: error: Request timed out (30.0s).
$ http host2:port
{ "hostname": "containerB" } # this container running at host2
Describe the results you expected:
I expect to be able to reach all of running containers by querying public address of any single node.
Additional information you deem important (e.g. issue happens only occasionally):
It seems to me that wireguard/tunnel itself is not the cause as I still able to send pings between containers. For example, containerB can reach those containerA addresses:
10.255.0.4 @lo
~0.050 ms (looks like this actually don’t leave host2)10.255.0.5 @eth0
~0.700 ms (I can see this withtcpdump
on other end, it’s reachable!)172.18.0.3 @eth1
~0.050 ms (this probably don’t leave host2 too)
Due to using --advertise-addr
I can see packets running between hosts via private interface.
I tried to install ntp
and sync the clock but this not helped.
I also attempted to apply various fixes (e.g. turn off masquerading, re-create default bridge with lower MTU, set default bind IP, etc), but got no luck.
I reproduced the issue 3 already times with clean setup and ready to provide collaborators access to my test hosts if you would like to investigate onsite.
Output of docker version
:
Same on both hosts:
Client:
Version: 18.03.0-ce
API version: 1.37
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:10:01 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.0-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:08:31 2018
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.03.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: rdwi6u922eb93s3z3cq1vuih1
Is Manager: true
ClusterID: g8urrtm78sc68oro86k3wvjzf
Managers: 1
Nodes: 2
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.0.5.1
Manager Addresses:
10.0.5.1:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.13.0-37-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 481.8MiB
Name: test1
ID: IS5W:2U5W:XDAE:UXIF:KXRR:FQSU:PI7K:UXEQ:OOHK:HC4O:TLZR:P4UU
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.):
Wireguard setup guide (assuming you installed it):
### Server
cd /etc/wireguard
umask 077
wg genkey | tee server_private_key | wg pubkey > server_public_key
# /etc/wireguard/wg0.conf
[Interface]
Address = 10.0.5.1/32
SaveConfig = true
PrivateKey = <paste server private key here>
ListenPort = 51820
[Peer]
PublicKey = <paste client public key here>
AllowedIPs = 10.0.5.2/32
wg-quick up wg0
### Client
cd /etc/wireguard
umask 077
wg genkey | tee server_private_key | wg pubkey > server_public_key
# /etc/wireguard/wg0.conf
[Interface]
Address = 10.0.5.2/32
PrivateKey = <paste client private key here>
[Peer]
PublicKey = <paste server public key here>
Endpoint = <paste server IP here>:51820
AllowedIPs = 10.0.5.0/24
wg-quick up wg0
Servers should be reachable via internal addresses in a moment after this steps.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 5
- Comments: 26 (2 by maintainers)
Managed to fix it for me.
Since requests to the webserver in my swarm are handed over by the ingress lb but responses are timing out i started testing if it has something to do with the size of the packets which are transferred. While playing with ping -s $packetsize i realized that packets bigger (if i remember correct) 1420 bytes are dropped.
Wireguard uses a MTU of 1420 and Docker uses a MTU of 1500 per default.
After lowering the MTU of the Ingress Overlay Network everything is working as expected for me.
All steps are done on a manager node:
At this point everything was working for me. I’m pretty sure a MTU of 1280 is a bit too low in this case but im still testing.
Hope i could help some other people facing the same problems.
grettings
Another important thing that I’ve noticed!
While recreating the ingress with the correct MTU fixes Outside World -> Swarm Cluster communication, it does not fix communication within your cluster!
I have a app that routes traffic between four of my containers, and I have two Swarm Nodes
http-coordinator
,container1
,container2
container3
,container4
I was having issue that, sometimes, requests to the
container3
andcontainer4
were timing out, while requests tocontainer1
andcontainer2
weren’t.And the issue is… drum roll
To fix this, you need to set the MTU in your
docker-compose.yml
files too!Replace 1450 with your network interface’s MTU! :3
Source
OK I have changed my wireguard MTU to 1500 which is the MTU of docker and my openstack interface, it’s working too. So I hope it won’t split the packets now I have the same MTU everywhere 😃
Yes try Wireguard, really simple, just a little time consuming to administrate. As a DevOps, I like when it’s automated like swarm 😄. Maybe if you have issues with wireguard, I coud help you like you did for me 👍
Haha, right when you were typing your comment, I was updating my previous comment with newer things that I found while playing around with the MTU size. :3
The issue is that Docker uses MTU 1500 by default and it doesn’t inherit the MTU set by your network interface.
The MTU that you are setting in the ingress network should be the same MTU set in your network interface!
To get the MTU of your network, use
ip a
and find the interface that you have set as theadvertise-addr
parameter (if you haven’t set it, it is the default network interface)In this case, we can see that the MTU of my network is 1450, so we are going to set our ingress network to also be 1450!
So, check what’s the MTU that Wireguard is using by using
ip a
and checking themtu xxxx
field, then use that value when creating your ingress network. It should, hopefully, work fine without any issues. 😃(At least on my side it worked fine after I bumped from 1400 to 1450. My network interface has its MTU set to 1450 because it is a VXLAN network)
Ok with your setup it’s working !! 🥇 thanks for all. My multi cloud swarm over wireguard is working well now 😃
@bdoublet91 I’m back from my vacation, so I decided to play around a bit more:
I created a Swarm with
sudo docker swarm init --default-addr-pool 192.168.128.0/18 --advertise-addr 172.29.10.1
After creation, I inspected the
overlay
network to check how it looks like before we start to mess around with it.I added a worker node to my swarm and deployed my personal website with two replicas to my Swarm, so both VMs (manager and worker) had one container running my website.
Tried
curl
’ing my website from outside of my swarm VM…So as we can see, it isn’t working correctly: Some requests work fine, some requests hang indefinitely…
Outdated incorrect explanation about the issue
The Swarm VMs are communicating on a VXLAN connection and then ON TOP of that Docker is communicating via VXLAN too, so this could be the reason why it isn't working! VXLAN requires a 50 byte header for traffic routing, that's why the MTU is set to 1450 on VXLAN connections instead of the default 1500, so let's do some calculations here:~~Which is OK, but what if we are using a VXLAN connection for node communication too? (Example: Your VMs are connected via VXLAN)
But… our Docker ingress connection expects 1450 bytes! So maybe this is what is causing the issue! Let’s change the ingress network from the default MTU (I’m not sure, but I guess it is 1450 bytes) to 1400 bytes!
Docker, by default, uses 1500 MTU for ingress communication, it doesn’t inherit the MTU configuration of the network interface. So, what we need to do is change the MTU of the ingress to match the network interface that we are using for intra node communication!
To get the MTU of your network, use
ip a
and find the interface that you have set as theadvertise-addr
parameter (if you haven’t set it, it is the default network interface)In this case, we can see that the MTU of my network is 1450, so we are going to set our ingress network to also be 1450!
First, let’s inspect the
ingress
network againThere are new things here, but we can safely ignore them.
First, we need to remove all running stacks, then…
docker network rm ingress
systemctl restart docker
)docker network create --driver overlay --ingress --opt com.docker.network.driver.mtu=1450 --subnet 192.168.128.0/24 --gateway 192.168.128.1 ingress
subnet
andgateway
are on theIPAM
section!systemctl restart docker
Now let’s inspect the network again
The configuration should be similar to your previous configuration. Don’t worry if the
com.docker.network.driver.overlay.vxlanid_list
changed, that’s used for VXLAN tagging. (And you can set it to 4096 by using--opt com.docker.network.driver.overlay.vxlanid_list=4096
when creating the network)And then everything should work fine! I tried flooding
curl 10.29.10.1:40000/br/
and none of the requests failed.@bdoublet91 let’s suppose you have a Linux network interface that has the MTU set to 1450 (it is a VXLAN connection). The VXLAN IP range is
172.16.0.0/12
, and you want your containers in Docker Swarm to use the192.168.128.0/18
range:docker swarm init --default-addr-pool 192.168.128.0/18
Then
Sadly I didn’t have too much time to play with it to check if it really fixed the issue or not, because one day after I was experimenting with Swarm after getting too burnt out with Kubernetes, I went on vacation, but on my limited testing (using
curl
to spam my personal website), which has two replicas, hosted on two Docker Swarm VMs).Before the change some connections were randomly hanging up when trying to query the website, and after the fix the issue went away, so ymmv!
Technically the MTU should also work fine if it was set to 1400 instead of 1280, this is something that I still need to play around with after I’m back from vacation.
Yes, the issue still persists for current wireguard (0.0.20180910-wg1) and docker-ce (18.06.1-ce).
I have 2 nodes, both are are active and reachable over internal addresses but every 2nd request to docker service fails.
Sadly, I stuck at the same point. Could not figure out what blocks requests between docker nodes.
Not really, I’ve found a job. 😅
Gonna try to reproduce the issue today and report back.