moby: Docker swarm load balancing not working over private network

Description

Problem is probably similar to #25325. Docker can’t reach containers from hostB when I query hostA public address.

I’m using Docker swarm with 2 hosts, they are connected via wireguard tunnel and are reachable to each other. I’m able to ping those hosts from each other using internal addresses.

Then I initialize swarm mode using --advertise-addr, --data-path-addr and --listen-addr options, also stated internal addresses there. Hosts are visible via docker node ls, both active. No errors in syslog.

But when I create service with 2 replicas, I’m facing strange behavior, accessing service via one of public IPs, I’m able to reach only containers which are running on this particular node. Other requests fail with timeout.

Steps to reproduce the issue:

  1. Setup wireguard tunnel, check that it works fine.
  2. Setup docker in swarm mode.
  3. Run a serivce. I’m using this one: agrrh/dummy-service-py. It runs HTTP service on port 80 and answers with container’s hostname + random uuid.
  4. Scale service at least with 2 replicas. (docker service create --name dummy --replicas 2 --publish 8080:80 agrrh/dummy-service-py)
  5. Try to cycle through replicas querying HostA address.

Describe the results you received:

As I said, requests to containers on other nodes fail:

$ http host1:port
{ "hostname": "containerA" } # this container running at host1
$ http host1:port
http: error: Request timed out (30.0s).

$ http host2:port
http: error: Request timed out (30.0s).
$ http host2:port
{ "hostname": "containerB" } # this container running at host2

Describe the results you expected:

I expect to be able to reach all of running containers by querying public address of any single node.

Additional information you deem important (e.g. issue happens only occasionally):

It seems to me that wireguard/tunnel itself is not the cause as I still able to send pings between containers. For example, containerB can reach those containerA addresses:

  • 10.255.0.4 @lo ~0.050 ms (looks like this actually don’t leave host2)
  • 10.255.0.5 @eth0 ~0.700 ms (I can see this with tcpdump on other end, it’s reachable!)
  • 172.18.0.3 @eth1 ~0.050 ms (this probably don’t leave host2 too)

Due to using --advertise-addr I can see packets running between hosts via private interface.

I tried to install ntp and sync the clock but this not helped.

I also attempted to apply various fixes (e.g. turn off masquerading, re-create default bridge with lower MTU, set default bind IP, etc), but got no luck.

I reproduced the issue 3 already times with clean setup and ready to provide collaborators access to my test hosts if you would like to investigate onsite.

Output of docker version:

Same on both hosts:

Client:
 Version:	18.03.0-ce
 API version:	1.37
 Go version:	go1.9.4
 Git commit:	0520e24
 Built:	Wed Mar 21 23:10:01 2018
 OS/Arch:	linux/amd64
 Experimental:	false
 Orchestrator:	swarm

Server:
 Engine:
  Version:	18.03.0-ce
  API version:	1.37 (minimum version 1.12)
  Go version:	go1.9.4
  Git commit:	0520e24
  Built:	Wed Mar 21 23:08:31 2018
  OS/Arch:	linux/amd64
  Experimental:	false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 18.03.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: rdwi6u922eb93s3z3cq1vuih1
 Is Manager: true
 ClusterID: g8urrtm78sc68oro86k3wvjzf
 Managers: 1
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.0.5.1
 Manager Addresses:
  10.0.5.1:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-37-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 481.8MiB
Name: test1
ID: IS5W:2U5W:XDAE:UXIF:KXRR:FQSU:PI7K:UXEQ:OOHK:HC4O:TLZR:P4UU
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Wireguard setup guide (assuming you installed it):

### Server

cd /etc/wireguard
umask 077
wg genkey | tee server_private_key | wg pubkey > server_public_key

# /etc/wireguard/wg0.conf 
[Interface]
Address = 10.0.5.1/32
SaveConfig = true
PrivateKey = <paste server private key here>
ListenPort = 51820

[Peer]
PublicKey = <paste client public key here>
AllowedIPs = 10.0.5.2/32

wg-quick up wg0

### Client

cd /etc/wireguard
umask 077
wg genkey | tee server_private_key | wg pubkey > server_public_key

# /etc/wireguard/wg0.conf 
[Interface]
Address = 10.0.5.2/32
PrivateKey = <paste client private key here>

[Peer]
PublicKey = <paste server public key here>
Endpoint = <paste server IP here>:51820
AllowedIPs = 10.0.5.0/24

wg-quick up wg0

Servers should be reachable via internal addresses in a moment after this steps.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 5
  • Comments: 26 (2 by maintainers)

Most upvoted comments

Managed to fix it for me.

It's always MTU

Since requests to the webserver in my swarm are handed over by the ingress lb but responses are timing out i started testing if it has something to do with the size of the packets which are transferred. While playing with ping -s $packetsize i realized that packets bigger (if i remember correct) 1420 bytes are dropped.

Wireguard uses a MTU of 1420 and Docker uses a MTU of 1500 per default.

After lowering the MTU of the Ingress Overlay Network everything is working as expected for me.

All steps are done on a manager node:

  1. Stop all running services in your swarm
  2. docker network rm ingress (remove the ingress overlay network since we are creating a new one with a lower mtu)
  3. Create a new ingress overlay network with your prefered subnet & gateway
docker network create --driver overlay --ingress --opt com.docker.network.driver.mtu=1280 --subnet 10.11.0.0/24 --gateway 10.11.0.1 ingress 
  1. systemctl restart docker (don’t know if i was not patient enough but that populated the new created network to all other nodes)
  2. bring back your services

At this point everything was working for me. I’m pretty sure a MTU of 1280 is a bit too low in this case but im still testing.

Hope i could help some other people facing the same problems.

grettings

Another important thing that I’ve noticed!

While recreating the ingress with the correct MTU fixes Outside World -> Swarm Cluster communication, it does not fix communication within your cluster!

I have a app that routes traffic between four of my containers, and I have two Swarm Nodes

  • Node 1: http-coordinator, container1, container2
  • Node 2: container3, container4

I was having issue that, sometimes, requests to the container3 and container4 were timing out, while requests to container1 and container2 weren’t.

19:12:36.710 [eventLoopGroupProxy-4-1] WARN  n.p.l.c.m.i.InteractionsHttpCoordinator - Something went wrong while trying to forward the request!
io.ktor.client.plugins.HttpRequestTimeoutException: Request timeout has expired [url=http://cinnamon-production-2:12212, request_timeout=unknown ms]
        at io.ktor.client.engine.cio.EndpointKt$setupTimeout$timeoutJob$1.invokeSuspend(Endpoint.kt:247)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:570)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:749)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:677)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:664)

And the issue is… drum roll https://memegenerator.net/img/instances/37438597.jpg

To fix this, you need to set the MTU in your docker-compose.yml files too!

networks:
  default:
    driver: overlay
    driver_opts:
      com.docker.network.driver.mtu: 1450

Replace 1450 with your network interface’s MTU! :3

Source

OK I have changed my wireguard MTU to 1500 which is the MTU of docker and my openstack interface, it’s working too. So I hope it won’t split the packets now I have the same MTU everywhere 😃

Yes try Wireguard, really simple, just a little time consuming to administrate. As a DevOps, I like when it’s automated like swarm 😄. Maybe if you have issues with wireguard, I coud help you like you did for me 👍

Just a question about MTU, Could I setup the value to another like 1300 or 1370 ? I always saw value like 1280 or 1400 on forum, there is a specific rule to calculate MTU ? Then, could we change the wireguard MTU 1420 to 1500 ? Already tested but it doesnt work on my case…

Haha, right when you were typing your comment, I was updating my previous comment with newer things that I found while playing around with the MTU size. :3

The issue is that Docker uses MTU 1500 by default and it doesn’t inherit the MTU set by your network interface.

The MTU that you are setting in the ingress network should be the same MTU set in your network interface!

To get the MTU of your network, use ip a and find the interface that you have set as the advertise-addr parameter (if you haven’t set it, it is the default network interface)

3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether ...

In this case, we can see that the MTU of my network is 1450, so we are going to set our ingress network to also be 1450!

So, check what’s the MTU that Wireguard is using by using ip a and checking the mtu xxxx field, then use that value when creating your ingress network. It should, hopefully, work fine without any issues. 😃

(At least on my side it worked fine after I bumped from 1400 to 1450. My network interface has its MTU set to 1450 because it is a VXLAN network)

Ok with your setup it’s working !! 🥇 thanks for all. My multi cloud swarm over wireguard is working well now 😃

@bdoublet91 I’m back from my vacation, so I decided to play around a bit more:

I created a Swarm with sudo docker swarm init --default-addr-pool 192.168.128.0/18 --advertise-addr 172.29.10.1

After creation, I inspected the overlay network to check how it looks like before we start to mess around with it.

swarm@docker-swarm-manager-1:~$ sudo docker network ls
NETWORK ID     NAME              DRIVER    SCOPE
a54488e50cb3   bridge            bridge    local
5c9af67afb9a   docker_gwbridge   bridge    local
1cdb51e53a49   host              host      local
ll9lfov7eu7i   ingress           overlay   swarm
90ec800e99d5   none              null      local
6ec10d2e1af0   swarm_default     bridge    local
swarm@docker-swarm-manager-1:~$ sudo docker inspect ingress
[
    {
        "Name": "ingress",
        "Id": "ll9lfov7eu7i6wodq5xsrc0yn",
        "Created": "2022-08-07T17:04:21.685399396Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "192.168.128.0/24",
                    "Gateway": "192.168.128.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": true,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "ingress-sbox": {
                "Name": "ingress-endpoint",
                "EndpointID": "034a4390f781d05b88b8b6b9c52d35a9b407a0b90b4c58cb9c171cdb637ff0c2",
                "MacAddress": "02:42:c0:a8:80:02",
                "IPv4Address": "192.168.128.2/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4096"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "108faab843d1",
                "IP": "172.29.10.1"
            },
            {
                "Name": "078912a29389",
                "IP": "172.29.11.1"
            }
        ]
    }
]

I added a worker node to my swarm and deployed my personal website with two replicas to my Swarm, so both VMs (manager and worker) had one container running my website.

Tried curl’ing my website from outside of my swarm VM…

root@doge-reborn:~# curl 10.29.10.1:40000/br/
^C <--- it hung, so I exited curl
root@doge-reborn:~# curl 10.29.10.1:40000/br/
<!DOCTYPE html>
<html lang="pt">
  <head>
    <meta charset="utf-8">
root@doge-reborn:~# curl 10.29.10.1:40000/br/
^C <--- once again, rip
root@doge-reborn:~# curl 10.29.10.1:40000/br/
^C <--- *picture of a sad cat here*
root@doge-reborn:~# curl 10.29.10.1:40000/br/
<!DOCTYPE html>
<html lang="pt">
  <head>
...

So as we can see, it isn’t working correctly: Some requests work fine, some requests hang indefinitely…

Outdated incorrect explanation about the issue The Swarm VMs are communicating on a VXLAN connection and then ON TOP of that Docker is communicating via VXLAN too, so this could be the reason why it isn't working! VXLAN requires a 50 byte header for traffic routing, that's why the MTU is set to 1450 on VXLAN connections instead of the default 1500, so let's do some calculations here:~~
  • Default 1500 MTU
  • -50 bytes (Docker VXLAN)
  • = 1450 bytes

Which is OK, but what if we are using a VXLAN connection for node communication too? (Example: Your VMs are connected via VXLAN)

  • Default 1500 MTU
  • -50 bytes (Your VXLAN Connection)
  • -50 bytes (Docker VXLAN)
  • = 1400 bytes

But… our Docker ingress connection expects 1450 bytes! So maybe this is what is causing the issue! Let’s change the ingress network from the default MTU (I’m not sure, but I guess it is 1450 bytes) to 1400 bytes!

Docker, by default, uses 1500 MTU for ingress communication, it doesn’t inherit the MTU configuration of the network interface. So, what we need to do is change the MTU of the ingress to match the network interface that we are using for intra node communication!

To get the MTU of your network, use ip a and find the interface that you have set as the advertise-addr parameter (if you haven’t set it, it is the default network interface)

3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether ...

In this case, we can see that the MTU of my network is 1450, so we are going to set our ingress network to also be 1450!

First, let’s inspect the ingress network again

swarm@docker-swarm-manager-1:~$ sudo docker inspect ingress
[
    {
        "Name": "ingress",
        "Id": "ll9lfov7eu7i6wodq5xsrc0yn",
        "Created": "2022-08-07T17:04:21.685399396Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "192.168.128.0/24",
                    "Gateway": "192.168.128.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": true,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "421fbe812ee7194cee388c5f2869024b2b9a9fa8ba7ee9c2f84d6d13eb85c9be": {
                "Name": "powercms_powercms.2.lrn7pwe6wyw3rmzdrw6mxmf3l",
                "EndpointID": "247336a13f52f6d0fc26996e7c9dbb6f875a3df0c444a4c70fa80c2d6754e510",
                "MacAddress": "02:42:c0:a8:80:06",
                "IPv4Address": "192.168.128.6/24",
                "IPv6Address": ""
            },
            "ingress-sbox": {
                "Name": "ingress-endpoint",
                "EndpointID": "034a4390f781d05b88b8b6b9c52d35a9b407a0b90b4c58cb9c171cdb637ff0c2",
                "MacAddress": "02:42:c0:a8:80:02",
                "IPv4Address": "192.168.128.2/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4096"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "108faab843d1",
                "IP": "172.29.10.1"
            },
            {
                "Name": "078912a29389",
                "IP": "172.29.11.1"
            }
        ]
    }
]

There are new things here, but we can safely ignore them.

First, we need to remove all running stacks, then…

  • docker network rm ingress
  • (You may need to restart Docker with systemctl restart docker)
  • docker network create --driver overlay --ingress --opt com.docker.network.driver.mtu=1450 --subnet 192.168.128.0/24 --gateway 192.168.128.1 ingress
    • The subnet and gateway are on the IPAM section!
  • Restart Docker with systemctl restart docker
  • Restart Docker on every node

Now let’s inspect the network again

swarm@docker-swarm-manager-1:~$ sudo docker inspect ingress
[
    {
        "Name": "ingress",
        "Id": "n4kc7x5bsilk13w7r8zlqcxmk",
        "Created": "2022-08-07T17:16:43.164266372Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "192.168.128.0/24",
                    "Gateway": "192.168.128.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": true,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "ingress-sbox": {
                "Name": "ingress-endpoint",
                "EndpointID": "905eb24af09900820bb24b3ca831b16e60a2c154dea519a01e11785247229e06",
                "MacAddress": "02:42:c0:a8:80:02",
                "IPv4Address": "192.168.128.2/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.mtu": "1400",
            "com.docker.network.driver.overlay.vxlanid_list": "4097"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "37c1d2b7ef95",
                "IP": "172.29.10.1"
            },
            {
                "Name": "93cf234fada7",
                "IP": "172.29.11.1"
            }
        ]
    }
]

The configuration should be similar to your previous configuration. Don’t worry if the com.docker.network.driver.overlay.vxlanid_list changed, that’s used for VXLAN tagging. (And you can set it to 4096 by using --opt com.docker.network.driver.overlay.vxlanid_list=4096 when creating the network)

And then everything should work fine! I tried flooding curl 10.29.10.1:40000/br/ and none of the requests failed.

@bdoublet91 let’s suppose you have a Linux network interface that has the MTU set to 1450 (it is a VXLAN connection). The VXLAN IP range is 172.16.0.0/12, and you want your containers in Docker Swarm to use the 192.168.128.0/18 range:

docker swarm init --default-addr-pool 192.168.128.0/18

Then

docker network rm ingress
docker network create --driver overlay --ingress --opt com.docker.network.driver.mtu=1280 --subnet 192.168.128.0/18 --gateway 192.168.128.1 ingress
systemctl restart docker

Sadly I didn’t have too much time to play with it to check if it really fixed the issue or not, because one day after I was experimenting with Swarm after getting too burnt out with Kubernetes, I went on vacation, but on my limited testing (using curl to spam my personal website), which has two replicas, hosted on two Docker Swarm VMs).

Before the change some connections were randomly hanging up when trying to query the website, and after the fix the issue went away, so ymmv!

Technically the MTU should also work fine if it was set to 1400 instead of 1280, this is something that I still need to play around with after I’m back from vacation.

Yes, the issue still persists for current wireguard (0.0.20180910-wg1) and docker-ce (18.06.1-ce).

I have 2 nodes, both are are active and reachable over internal addresses but every 2nd request to docker service fails.

Sadly, I stuck at the same point. Could not figure out what blocks requests between docker nodes.

Not really, I’ve found a job. 😅

Gonna try to reproduce the issue today and report back.