moby: docker swarm join swallows errors
When joining as a manager, and the join fails (for example, when specifying an incorrect --advertise-addr
value), docker swarm join
does not report an error. Instead runs for awhile, and then exits silently, leaving the node in a bad state.
Example:
root@ip-172-31-12-164:~# docker swarm join --token <token> --advertise-addr 1.2.3.4 <manager addr>
root@ip-172-31-12-164:~# docker node ls
Error response from daemon: rpc error: code = 2 desc = raft: no elected cluster leader
cc @tonistiigi
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 2
- Comments: 23 (23 by maintainers)
It looks like this code path is followed in
Join
(daemon/cluster/cluster.go
):So the channel returned by
Ready()
gets closed, but that probably shouldn’t be happening, because the new node doesn’t join raft successfully.Looking at what causes this channel to be closed, it’s triggered by both the
agentReady
andmanagerReady
channels being closed. In particular,managerReady
is being closed prematurely. TherunManager
function in swarmkit’sagent/node.go
passes that channel toinitManagerConnection
initManagerConnection
performs a health check, and closes the channel once the health check succeeds.Next, I look at the server-side code for the health check, in
(*Manager).Run)
:The
ControlAPI
health check is set toSERVING
status before raft is started, which looks wrong to me. This seems to be the source of the problem. It lets the code inagent/node.go
believe the manager has initialized successfully when it has not even joined raft yet.If I remove this
localHealthServer.SetServingStatus
call to just afterJoinAndStart
, it seems to fix the immediate problem. Instead of getting no error, I get:So I think a PR to https://github.com/docker/swarmkit that moves the initialization of the
ControlAPI
health check later should fix this bug. Note that theRaft
health check initialization needs to stay where it is, because that’s necessary for joining the raft cluster.Let me know if you want to submit such a PR, or if I’ve spoiled all the fun, I can do it.
Not yet.
I was able to go ahead and create the pull request https://github.com/docker/swarmkit/pull/1555