moby: docker swarm join swallows errors
When joining as a manager, and the join fails (for example, when specifying an incorrect --advertise-addr value), docker swarm join does not report an error. Instead runs for awhile, and then exits silently, leaving the node in a bad state.
Example:
root@ip-172-31-12-164:~# docker swarm join --token <token> --advertise-addr 1.2.3.4 <manager addr>
root@ip-172-31-12-164:~# docker node ls
Error response from daemon: rpc error: code = 2 desc = raft: no elected cluster leader
cc @tonistiigi
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 2
- Comments: 23 (23 by maintainers)
It looks like this code path is followed in
Join(daemon/cluster/cluster.go):So the channel returned by
Ready()gets closed, but that probably shouldn’t be happening, because the new node doesn’t join raft successfully.Looking at what causes this channel to be closed, it’s triggered by both the
agentReadyandmanagerReadychannels being closed. In particular,managerReadyis being closed prematurely. TherunManagerfunction in swarmkit’sagent/node.gopasses that channel toinitManagerConnectioninitManagerConnectionperforms a health check, and closes the channel once the health check succeeds.Next, I look at the server-side code for the health check, in
(*Manager).Run):The
ControlAPIhealth check is set toSERVINGstatus before raft is started, which looks wrong to me. This seems to be the source of the problem. It lets the code inagent/node.gobelieve the manager has initialized successfully when it has not even joined raft yet.If I remove this
localHealthServer.SetServingStatuscall to just afterJoinAndStart, it seems to fix the immediate problem. Instead of getting no error, I get:So I think a PR to https://github.com/docker/swarmkit that moves the initialization of the
ControlAPIhealth check later should fix this bug. Note that theRafthealth check initialization needs to stay where it is, because that’s necessary for joining the raft cluster.Let me know if you want to submit such a PR, or if I’ve spoiled all the fun, I can do it.
Not yet.
I was able to go ahead and create the pull request https://github.com/docker/swarmkit/pull/1555