moby: "failed to hande address change" on swarm manager reboot (5 nodes, 3 managers) - never recovers, causes: "grpc: the connection is unavailable"
Description In a 5 node swarm cluster:
- 3 managers (Nodes: A, B, C)
- 2 workers (Nodes: D, and E)
All of the nodes are on rfc1918 space with 1-1 mapping externally.
When initing the swarm, on node A by:
docker swarm init --advertise-addr A-PUBLIC-IP --listen-addr a-10x:2377
When joining nodes (managers) B and C:
docker swarm join \
--token SWMTKN-manager-token \
--advertise-addr B-PUBLIC-IP --listen-addr 0.0.0.0:2377 A-PUBLIC-IP:2377
and
docker swarm join \
--token SWMTKN-manager-token \
--advertise-addr C-PUBLIC-IP --listen-addr 0.0.0.0:2377 A-PUBLIC-IP:2377
When joining nodes (workers) D, and E:
docker swarm join \
--token SWMTKN-worker-token \
--advertise-addr D-PUBLIC-IP --listen-addr 0.0.0.0:2377 A-PUBLIC-IP:2377
and
docker swarm join \
--token SWMTKN-worker-token \
--advertise-addr E-PUBLIC-IP --listen-addr 0.0.0.0:2377 A-PUBLIC-IP:2377
When you reboot any of the managers (including a non-leader), you get a message that:
level=warning msg="detected address change for 412c742ead14c77f (PUBLIC-IP-of-REBOOTED-NODE:2377 -> PRIVATE-GATEWAY-OF-NODES:2377)" raft_id=321d772168243d4b
Steps to reproduce the issue:
- Follow steps above to create swarm
- Reboot any manager.
- On reboot, even though it comes back up, it will not join the swarm/will not do a “docker node ls” and will error out.
Describe the results you received:
level=warning msg="detected address change for 412c742ead14c77f (A-publicIP:2377 -> A-private-gateway:2377)" raft_id=321d772168243d4b
An example:
level=warning msg="detected address change for 412c742ead14c77f (104.281.24.14.23:2377 -> 10.10.10.1:2377)" raft_id=321d772168243d4b
Describe the results you expected: Re-join/continue operation
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version
:
# docker version
Client:
Version: 1.13.0
API version: 1.25
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 09:58:26 2017
OS/Arch: linux/amd64
Server:
Version: 1.13.0
API version: 1.25 (minimum version 1.12)
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 09:58:26 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
docker info
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 22
Server Version: 1.13.0
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 35
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: eudw4ig52prutxf0dn0wv4209
Is Manager: true
ClusterID: y942ys43rlguyzenkv69naicc
Managers: 3
Nodes: 5
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: PUB-IP-NODE-A
Manager Addresses:
PUB-IP-NODE-A:2377
PUB-IP-NODE-B:2377
PUB-IP-NODE-C:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-59-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.859 GiB
Name: $hostname
ID: 4RZN:TIYL:SE4A:OLD6:MPOU:XGT7:7UYX:PRZ4:YXUT:BELD:Y24H:EHXE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
nfs=yes
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.): This is running on a VMs on physical hardware at a datacenter. The VM is feeding rfc1918 space, and has 1-1 mapping with public IPs.
Due to some of the VMs being at other datacenters, the PUBLIC IP has to be used for the --advertise-addr.
2 of the nodes are are behind the same switch – thus, they can reach each other directly on their 10x space, and this seems to be part of the issue that triggers this bug.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 21 (10 by maintainers)
After some discussion, I think the best way forward with the address change issue is to disable automatic address change detection in the next Docker 1.13 patch release, and aim to reenable it in 1.14 with additional safeguards (perhaps, for example, allowing a manager to have multiple addresses associated with it). The address change handling is not particularly useful on its own, because dynamic addresses are not supported by overlay networking, but they were intended as a best-effort way to avoid losing quorum accidentally, and an incremental first step towards full dynamic addressing support. I’ll be opening a PR in swarmkit to disable the feature for the moment.