moby: Routing mesh does not work in swarm mode if node in a swarm cluster is in different subnet

Description

Steps to reproduce the issue:

  1. create a three node swarm cluster with three VMs, (VM1, VM2 and VM3). VM1 and VM2 are in same subnet, but VM3 is in different subnet. VM1 IP: 10.161.42.200 VM2 IP: 10.161.60.160 VM3 IP: 10.192.176.97
  2. create a nginx service on swarm manager (VM1)
docker service create --name nginx_service --publish 8080:80 nginx

docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
ixjuubqpovud        nginx_service       replicated          1/1                 nginx:latest        *:8080->80/tcp

  1. Run “curl 127.0.0.1:8080” to access the service from VM1, VM2, and VM3

  2. I have opened all the required port on those three VMs 2376/tcp 2377/tcp 7946/tcp 7946/udp 4789/tcp 4789/udp

Describe the results you received: On VM1 and VM2, can access the nginx service correctly.

VM1:

root@sc-rdops-vm18-dhcp-57-89:~# curl 127.0.0.1:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

On VM3, cannot access the nginx service, curl command timed out

root@sc-rdops-vm18-dhcp-57-89:~# curl 127.0.0.1:8080
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection timed out

Describe the results you expected: On VM3, should be able to access nginx service

Additional information you deem important (e.g. issue happens only occasionally): We don’t see this issue if all VMs in the swarm cluster are in the same subnet.

Output of docker version:

root@sc-rdops-vm18-dhcp-57-89:~# docker version
Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:18 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:56 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

root@sc-rdops-vm18-dhcp-57-89:~# docker info
Containers: 2
 Running: 1
 Paused: 0
 Stopped: 1
Images: 2
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: r3u8r9wwbog8otva9lmijgd7y
 Is Manager: true
 ClusterID: bidsno23gl3b3dbsh740822xg
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.161.42.200
 Manager Addresses:
  10.161.42.200:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-42-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 992.5MiB
Name: sc-rdops-vm18-dhcp-57-89
ID: 4AZ6:SPTC:QECB:MWW6:TAJG:TEUN:EI3O:U6PM:FQCB:2HKM:UVKB:C7LF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.): VMs running on Vmware ESX

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 16 (2 by maintainers)

Most upvoted comments

I had the same issue and it turned out to be that something (I’m guessing an external network component maybe NIC or switch or router) was filtering out certain VXLAN traffic on 4789.

You can confirm by running a background tcpdump for udp port 4789 on the host where you run the service (eg. curl) that never hears back from the swarm service on the host in the different subnet (eg. nginx). You will see that tcpdump captures some traffic when you curl the nginx service in the same subnet, but captures nothing when you curl the nginx service in the different subnet. If you are using a single nginx service with replicas that you know are distributed to hosts in the different subnets, just curl multiple times.

My guess is that whatever is filtering the traffic does so if the traffic doesn’t originate from the destined subnet? Just a guess though.

To workaround it, create a new swarm and specify a different port for VXLAN traffic (I picked 4790 and it finally worked): swarm init --data-path-addr 4790. Use netcat to verify that hosts from different subnets can udp to each other on the new port you pick: nc -ul 4790 and nc -u <host> 4790.