moby: Swarm 17.07.0 unable to join nodes

Description

Hi,

Upgrade from docker-ce 17.06.2 to docker-ce-17.07.0 break my cluster. My manager is up and running but worker (or manager) nodes cannot join it.

Steps to reproduce the issue:

  1. install docker-ce-17.07.0 on 2 VMs (say M and W)
  2. run docker swarm init --advertise-addr=<M_public_ip>:2377 on M (get the worker join command)
  3. run the join command on W

Describe the results you received:

Error response from daemon: rpc error: code = Unavailable desc = grpc: the connection is unavailable

Describe the results you expected:

This node joined a swarm as a worker.

Additional information you deem important (e.g. issue happens only occasionally):

I used a Terrafom+Ansible that was successfully ran many time to create clusters. It even appears on an existing cluster after upgrading to 17.07.0

Output of docker version:

Client:
 Version:      17.07.0-ce
 API version:  1.31
 Go version:   go1.8.3
 Git commit:   8784753
 Built:        Tue Aug 29 17:42:01 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.07.0-ce
 API version:  1.31 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   8784753
 Built:        Tue Aug 29 17:43:23 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 17.07.0-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: eqt38cc9zdy5i15yflndsj0ze
 Is Manager: true
 ClusterID: u7yvqf60ew3r1n867bjstz2bh
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: 10.90.251.200
 Manager Addresses:
  10.90.251.200:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.21.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.797GiB
Name: swarm-mode-latest-master-0
ID: W4TM:QHM2:MWD4:52SK:XMIM:TJA5:2NCX:4WAB:77C5:TF66:WVKX:XGTD
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://some-proxy:3128
Https Proxy: http://some-proxy:3128
No Proxy: localhost,172.17.42.1,.sock
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

The swarm mode cluster run on KVM VMs

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 2
  • Comments: 30 (9 by maintainers)

Most upvoted comments

I fixed this on CentOS7 by adding firewalld rules on the manager nodes:

sudo firewall-cmd --add-port=2376/tcp --permanent  
sudo firewall-cmd --add-port=2377/tcp --permanent  
sudo firewall-cmd --add-port=7946/tcp --permanent  
sudo firewall-cmd --add-port=7946/udp --permanent  
sudo firewall-cmd --add-port=4789/udp --permanent
sudo systemctl restart firewalld

I don’t have a proxy between my manager and worker.

manager ip: 192.168.2.103 worker ip : 192.168.2.102

I ssh in to the worker from the manager node and issue the swarm join command:

docker swarm join --token SWMTKN-1-578xkqwthbmkn7fnegmtx4pn73h0wh5qgvzw03gb50xmg1t0qu-a1k7myosfu4zkruw7y1x81l1k 192.168.2.103:2377

Error response from daemon: rpc error: code = Unavailable desc = grpc: the connection is unavailable

How to fix this? Can’t understand how you guys solved this after following this thread???

Please help

Hey guys, I just init the swarm in my current mac, which have the same issue with you.

$ docker swarm init --advertise-addr 192.168.99.100

If I create a manager1 firstly, then ssh into it and init the swarm, it works.

$ docker-machine create manager1
$ docker-machine ip manager1
\# 192.168.99.104
$ docker-machine ssh manager1

docker@manager1:~$ docker swarm init --advertise-addr 192.168.99.104

Now, I create a worker1, then ssh into it and join it to the swarm cluster, it works perfectly.

$ docker-machine create -d virtualbox worker1
$ docker-machine ssh worker1

docker@worker1:~$ docker swarm join --token SWMTKN-1-1s06e6ubid2axcj3h9bo06smlj32jhqp74pj1wkmflb3f0x5pi-cde5dm1xdzo3eum3s7lrj96aa 192.168.99.104:2377

So, the problem is that we might not init the swarm in our current macos, but in a manager we create.

BTW, I’ve published this post on Docker forum too.

Is this issue going to be fixed? Because this means that if you want to add a new master to a swarm you have to restart all the other master nodes to update their no_proxy list (no_proxy cannot use wild-cards). Which is really very annoying.

To be concrete: in the Docker daemon a http_proxy variable was set because a proxy is needed to access the DockerRegistry. This unexpectedly caused all internal gRPC traffic to go via the outgoing proxy, causing much mayhem. If you know the IP of your future nodes upfront you can mitigate it a bit, but it is very annoying.

AFAICS gRPC support the concept of a “proxy mapper” which would allow Docker control whether the proxy is used or not, and thus automatically exclude other nodes.

in any case, having to add all potential swarm participant nodes mutually to their NO_PROXY environment variable just lets me ansible some more…

Proxy issue… While it wasn’t needed in version < 17.07.0, it is now required to configure daemon proxy according your infra. Adding master ip in no_proxy (docker daemon service) fix the issue. thx @ktoublanc