moby: docker swarm join swallows errors

When joining as a manager, and the join fails (for example, when specifying an incorrect --advertise-addr value), docker swarm join does not report an error. Instead runs for awhile, and then exits silently, leaving the node in a bad state.

Example:

root@ip-172-31-12-164:~# docker swarm join --token <token> --advertise-addr 1.2.3.4 <manager addr>

root@ip-172-31-12-164:~# docker node ls


Error response from daemon: rpc error: code = 2 desc = raft: no elected cluster leader

cc @tonistiigi

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 2
  • Comments: 23 (23 by maintainers)

Most upvoted comments

It looks like this code path is followed in Join (daemon/cluster/cluster.go):

        case <-n.Ready():
                go c.reconnectOnFailure(n)
                return nil

So the channel returned by Ready() gets closed, but that probably shouldn’t be happening, because the new node doesn’t join raft successfully.

Looking at what causes this channel to be closed, it’s triggered by both the agentReady and managerReady channels being closed. In particular, managerReady is being closed prematurely. The runManager function in swarmkit’s agent/node.go passes that channel to initManagerConnection

initManagerConnection performs a health check, and closes the channel once the health check succeeds.

Next, I look at the server-side code for the health check, in (*Manager).Run):

        // Set the raft server as serving for the health server
        healthServer.SetServingStatus("Raft", api.HealthCheckResponse_SERVING)
        localHealthServer.SetServingStatus("ControlAPI", api.HealthCheckResponse_SERVING)

        defer func() {
                m.server.Stop()
                m.localserver.Stop()
        }()

        if err := m.RaftNode.JoinAndStart(); err != nil {
                return fmt.Errorf("can't initialize raft node: %v", err)
        }

The ControlAPI health check is set to SERVING status before raft is started, which looks wrong to me. This seems to be the source of the problem. It lets the code in agent/node.go believe the manager has initialized successfully when it has not even joined raft yet.

If I remove this localHealthServer.SetServingStatus call to just after JoinAndStart, it seems to fix the immediate problem. Instead of getting no error, I get:

Error response from daemon: Timeout was reached before node was joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node.

So I think a PR to https://github.com/docker/swarmkit that moves the initialization of the ControlAPI health check later should fix this bug. Note that the Raft health check initialization needs to stay where it is, because that’s necessary for joining the raft cluster.

Let me know if you want to submit such a PR, or if I’ve spoiled all the fun, I can do it.

Thanks @srodman7689! 👍

@aaronlehmann Can we close this issue? I’m not sure the fix made it into a SwarmKit vendoring yet.

Not yet.

I was able to go ahead and create the pull request https://github.com/docker/swarmkit/pull/1555