moby: "failed to hande address change" on swarm manager reboot (5 nodes, 3 managers) - never recovers, causes: "grpc: the connection is unavailable"

Description In a 5 node swarm cluster:

3 managers (Nodes: A, B, C)
2 workers (Nodes: D, and E)

All of the nodes are on rfc1918 space with 1-1 mapping externally.

When initing the swarm, on node A by: docker swarm init --advertise-addr A-PUBLIC-IP --listen-addr a-10x:2377

When joining nodes (managers) B and C:

docker swarm join \
--token SWMTKN-manager-token \
--advertise-addr B-PUBLIC-IP --listen-addr 0.0.0.0:2377 A-PUBLIC-IP:2377

and

docker swarm join \
--token SWMTKN-manager-token \
--advertise-addr C-PUBLIC-IP --listen-addr 0.0.0.0:2377 A-PUBLIC-IP:2377

When joining nodes (workers) D, and E:

docker swarm join \
--token SWMTKN-worker-token \
--advertise-addr D-PUBLIC-IP --listen-addr 0.0.0.0:2377 A-PUBLIC-IP:2377

and

docker swarm join \
--token SWMTKN-worker-token \
--advertise-addr E-PUBLIC-IP --listen-addr 0.0.0.0:2377 A-PUBLIC-IP:2377

When you reboot any of the managers (including a non-leader), you get a message that:

level=warning msg="detected address change for 412c742ead14c77f (PUBLIC-IP-of-REBOOTED-NODE:2377 -> PRIVATE-GATEWAY-OF-NODES:2377)" raft_id=321d772168243d4b

Steps to reproduce the issue:

Follow steps above to create swarm
Reboot any manager.
On reboot, even though it comes back up, it will not join the swarm/will not do a “docker node ls” and will error out.

Describe the results you received:

level=warning msg="detected address change for 412c742ead14c77f (A-publicIP:2377 -> A-private-gateway:2377)" raft_id=321d772168243d4b

An example:

level=warning msg="detected address change for 412c742ead14c77f (104.281.24.14.23:2377 -> 10.10.10.1:2377)" raft_id=321d772168243d4b

Describe the results you expected: Re-join/continue operation

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

# docker version
Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

docker info
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 22
Server Version: 1.13.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 35
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: eudw4ig52prutxf0dn0wv4209
 Is Manager: true
 ClusterID: y942ys43rlguyzenkv69naicc
 Managers: 3
 Nodes: 5
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: PUB-IP-NODE-A
 Manager Addresses:
  PUB-IP-NODE-A:2377
  PUB-IP-NODE-B:2377
  PUB-IP-NODE-C:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-59-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.859 GiB
Name: $hostname
ID: 4RZN:TIYL:SE4A:OLD6:MPOU:XGT7:7UYX:PRZ4:YXUT:BELD:Y24H:EHXE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 nfs=yes
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): This is running on a VMs on physical hardware at a datacenter. The VM is feeding rfc1918 space, and has 1-1 mapping with public IPs.

Due to some of the VMs being at other datacenters, the PUBLIC IP has to be used for the --advertise-addr.

2 of the nodes are are behind the same switch – thus, they can reach each other directly on their 10x space, and this seems to be part of the issue that triggers this bug.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 21 (10 by maintainers)

Most upvoted comments

After some discussion, I think the best way forward with the address change issue is to disable automatic address change detection in the next Docker 1.13 patch release, and aim to reenable it in 1.14 with additional safeguards (perhaps, for example, allowing a manager to have multiple addresses associated with it). The address change handling is not particularly useful on its own, because dynamic addresses are not supported by overlay networking, but they were intended as a best-effort way to avoid losing quorum accidentally, and an incremental first step towards full dynamic addressing support. I’ll be opening a PR in swarmkit to disable the feature for the moment.

aaronlehmann on Jan 30, 2017