moby: Routing mesh does not work in swarm mode if node in a swarm cluster is in different subnet
Description
Steps to reproduce the issue:
- create a three node swarm cluster with three VMs, (VM1, VM2 and VM3). VM1 and VM2 are in same subnet, but VM3 is in different subnet. VM1 IP: 10.161.42.200 VM2 IP: 10.161.60.160 VM3 IP: 10.192.176.97
- create a nginx service on swarm manager (VM1)
docker service create --name nginx_service --publish 8080:80 nginx
docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
ixjuubqpovud nginx_service replicated 1/1 nginx:latest *:8080->80/tcp
-
Run “curl 127.0.0.1:8080” to access the service from VM1, VM2, and VM3
-
I have opened all the required port on those three VMs 2376/tcp 2377/tcp 7946/tcp 7946/udp 4789/tcp 4789/udp
Describe the results you received: On VM1 and VM2, can access the nginx service correctly.
VM1:
root@sc-rdops-vm18-dhcp-57-89:~# curl 127.0.0.1:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
body {
width: 35em;
margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif;
}
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>
<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>
<p><em>Thank you for using nginx.</em></p>
</body>
</html>
On VM3, cannot access the nginx service, curl command timed out
root@sc-rdops-vm18-dhcp-57-89:~# curl 127.0.0.1:8080
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection timed out
Describe the results you expected: On VM3, should be able to access nginx service
Additional information you deem important (e.g. issue happens only occasionally): We don’t see this issue if all VMs in the swarm cluster are in the same subnet.
Output of docker version
:
root@sc-rdops-vm18-dhcp-57-89:~# docker version
Client:
Version: 17.09.0-ce
API version: 1.32
Go version: go1.8.3
Git commit: afdb6d4
Built: Tue Sep 26 22:42:18 2017
OS/Arch: linux/amd64
Server:
Version: 17.09.0-ce
API version: 1.32 (minimum version 1.12)
Go version: go1.8.3
Git commit: afdb6d4
Built: Tue Sep 26 22:40:56 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
root@sc-rdops-vm18-dhcp-57-89:~# docker info
Containers: 2
Running: 1
Paused: 0
Stopped: 1
Images: 2
Server Version: 17.09.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: r3u8r9wwbog8otva9lmijgd7y
Is Manager: true
ClusterID: bidsno23gl3b3dbsh740822xg
Managers: 1
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.161.42.200
Manager Addresses:
10.161.42.200:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-42-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 992.5MiB
Name: sc-rdops-vm18-dhcp-57-89
ID: 4AZ6:SPTC:QECB:MWW6:TAJG:TEUN:EI3O:U6PM:FQCB:2HKM:UVKB:C7LF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.): VMs running on Vmware ESX
About this issue
- Original URL
- State: open
- Created 7 years ago
- Comments: 16 (2 by maintainers)
I had the same issue and it turned out to be that something (I’m guessing an external network component maybe NIC or switch or router) was filtering out certain VXLAN traffic on 4789.
You can confirm by running a background
tcpdump
forudp port 4789
on the host where you run the service (eg.curl
) that never hears back from the swarm service on the host in the different subnet (eg.nginx
). You will see thattcpdump
captures some traffic when youcurl
thenginx
service in the same subnet, but captures nothing when youcurl
thenginx
service in the different subnet. If you are using a singlenginx
service with replicas that you know are distributed to hosts in the different subnets, justcurl
multiple times.My guess is that whatever is filtering the traffic does so if the traffic doesn’t originate from the destined subnet? Just a guess though.
To workaround it, create a new swarm and specify a different port for VXLAN traffic (I picked 4790 and it finally worked):
swarm init --data-path-addr 4790
. Usenetcat
to verify that hosts from different subnets canudp
to each other on the new port you pick:nc -ul 4790
andnc -u <host> 4790
.