moby: single manager docker swarm stuck in Down state after reboot

Description

A single-manager swarm cluster gets into the following state after a abrupt reboot:

# docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
ux3enftr6krhlrp8bbodo2q1l *   wsdocker6           Down                Active              Leader

The manager can’t leave the swarm, as the docker swarm leave --force command times out.

The above state shouldn’t be possible to be in. By printing the above, the node must be a manager, and since there’s only one manager, it has to infer that itself is that manager, and it thus can’t be Down.

Steps to reproduce the issue:

  1. Create a swarm with 1 manager
  2. Abruptly reboot
  3. Profit

Describe the results you received:

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

 docker version
Client:
 Version:      17.06.2-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   cec0b72
 Built:        Tue Sep  5 20:00:17 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.2-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   cec0b72
 Built:        Tue Sep  5 19:59:11 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

root@wsdocker6:/etc/systemd/system# docker info
Containers: 83
 Running: 1
 Paused: 0
 Stopped: 82
Images: 103
Server Version: 17.06.2-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: pending
 NodeID: ux3enftr6krhlrp8bbodo2q1l
 Is Manager: true
 ClusterID: nhkjyh6xo2jdu2txkjt95xfua
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: x.x.x.x
 Manager Addresses:
  x.x.x.x:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-93-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.63GiB
Name: wsdocker6
ID: KRY5:PQU2:2KPP:7QR6:4EXD:QFXT:KKVB:CZVN:OQUU:S5A3:Z5ZT:XQGN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: xxx
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 6
  • Comments: 53 (13 by maintainers)

Most upvoted comments

For me, I had this problem with a swarm worker (on CentOS 7 - Docker version: 18.06.0-ce). The only error I could find was:

swarm component could not be started before timeout was reached

Looking around in /var/lib/docker there was 1 huge file (besides the normal volumes etc). tasks.db was 15GB!

[root@swarmwork-02 worker]# ll /var/lib/docker/swarm/worker/tasks.db 
-rw-r--r--. 1 root root 15726108672 Jan 25 13:32 /var/lib/docker/swarm/worker/tasks.db

I stopped docker, deleted the file and everything was fine. The manager sees the worker again. Hopefully this helps others that come across this issue.

We’ve made a patch to help keep the task.db size down. It does not fix issues with already large task.db files. This should be in the next release.

Inspecting the node id I get among other things:

        },
        "Status": {
            "State": "down",
            "Message": "heartbeat failure",
            "Addr": "a.b.c.d"
        },

Closing since this should be resolved in 19.03.9

Thanks for reporting back!

There’s a WIP PR in SwarmKit; https://github.com/docker/swarmkit/pull/2917

For me, I had this problem with a swarm worker (on CentOS 7 - Docker version: 18.06.0-ce). The only error I could find was:

swarm component could not be started before timeout was reached

Looking around in /var/lib/docker there was 1 huge file (besides the normal volumes etc). tasks.db was 15GB!

[root@swarmwork-02 worker]# ll /var/lib/docker/swarm/worker/tasks.db 
-rw-r--r--. 1 root root 15726108672 Jan 25 13:32 /var/lib/docker/swarm/worker/tasks.db

I stopped docker, deleted the file and everything was fine. The manager sees the worker again. Hopefully this helps others that come across this issue.

This worked for me. Thanks @ryandaniels , you’re awesome. Docker version 19.03.5, build 633a0ea

Removing tasks.db (on the node that will not rejoin) immediately resolved the problem for me as well.

Solution to remove task.db worked for me aswell 👍

docker -v Docker version 19.03.5, build 633a0ea838 one node swarm the same issue

Seeing the same problem on 18.05-ce. I stopped the daemon for about 5 minutes using service docker stop. On service docker start I see: ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION nj4oo60bh6uvovtmehgucn34g ions1 Ready Active Reachable 18.05.0-ce k2quh59urx244fo487og0z669 * ions2 Down Active Reachable 18.05.0-ce s97wdji24ymbnuxs08snaedbq ions3 Ready Active Leader 18.05.0-ce

    "Status": {
        "State": "down",
        "Message": "heartbeat failure for node in \"unknown\" state",
        "Addr": "0.0.0.0"
    },

I’ll let it sit & see if it fixes itself.

Logs: https://gist.github.com/drnybble/cd1d4deb72282cb23803373dec1a327c

Turned on debug, all I see is:

Aug 10 07:39:17 ions2 systemd[1]: Starting Docker Application Container Engine… Aug 10 07:39:39 ions2 dockerd[18588]: time=“2018-08-10T07:39:39.057410260-07:00” level=error msg=“swarm component could not be started before timeout was reached” Aug 10 07:39:39 ions2 dockerd[18588]: time=“2018-08-10T07:39:39.057532327-07:00” level=info msg=“Daemon has completed initialization” A

Going to bed, when I woke up, the host was healthy. Looking at the logs it seems like the host became healthy after 1 hour. Again, guessing that either some time-out happened or some raft log processing that took a long time.

I found this issue after also having issues bringing up my single-node docker swarm after reboots. For me, deleting /var/lib/docker/swarm/worker/tasks.db worked as a workaround in the sense that i could successfully run docker stack deploy again (which otherwise didn’t work as the swarm wasn’t running), it not only felt wrong and dangerous but after automating that to be done before every docker-daemon start - i had another problem: the swarm didn’t automatically start again after reboots (i guess this is stored inside the task.db somehow).

So i wanted to document my journey and learnings here. If this is the wrong place or there is a better place to document this, please let me know.

I’m using a raspberrypi 4b with Fedora 35 IoT Edition (aarch64)

I found that the problem is that the raspberry is extremely bad at keeping the time. My hardware clock was over 1 month behind. As i’m running Fedora - i don’t know how or if i can sync the hardware clock from the Raspberry via Fedora (i assume that only works in raspbian?).

Anyway, i found out about the timedrift by checking journalctl:

Nov 15 00:00:31 rpi.fritz.box dockerd[815]: time="2021-11-15T00:00:31.869376273Z" level=error msg="cluster exited with error: error while validating Root CA Certificate: x509: certificate has expired or is not yet valid: current time 2021-11-15T00:00:31Z is before 2021-12-24T12:18:00Z"
Nov 15 00:00:31 rpi.fritz.box dockerd[815]: time="2021-11-15T00:00:31.869607826Z" level=error msg="swarm component could not be started" error="error while validating Root CA Certificate: x509: certificate has expired or is not yet valid: current time 2021-11-15T00:00:31Z is before 2021-12-24T12:18:00Z"

Notice how the logs are for 15th Nov (today is 24th Dec). I created the swarm today. So the issue is that the certificates that docker creates internally for management of the swarm nodes are invalid as when the raspberrypi boots, it’s thinking it’s 15th Nov and the cert begins to be valid on the 24th Dec. So it’s not yet valid.

Then i checked journalctl -b to see that a lot of log lines are from the past, even though the (re)boot happened just 5 minutes ago.

chronyd seems to be the daemon that Fedora uses by default to ensure correct system time. I disabled that and use the systemd native one: systemctl stop chronyd && systemctl disable chronyd && systemctl enable systemd-timesyncd && systemctl start systemd-timesyncd

Now i ensure the docker service waits for the system time to be set by adding time-set.target to the After= section of the docker unit-file

[Unit]
After=network-online.target firewalld.service containerd.service time-set.target

via systemctl edit docker.

I rebooted. It showed all docker log-entries with the correct time. However, my swarm still didn’t persist between reboots. It showed:

Dec 24 13:41:01 rpi.fritz.box dockerd[820]: time="2021-12-24T13:41:01.720958879Z" level=error msg="error creating cluster object" error="name conflicts with an existing object" module=node node.id=uj13vzai54eeuyne7t2v1obcc```

Which i solved by completely tearing down my (single-node) swarm and recreating it: docker swarm leave --force followed by a docker swarm init [...] command.

To my surprise my data in my containers was still kept - but please be sure what you are doing with your swarm config and don’t blame me for data-loss.

Now with docker enabled (systemctl enable docker) my swarm automatically reboots with my machine - as it should be.

P.S.: i also created a pull-request #43107 so this might work out-of-the-box in the future.

I can’t believe that worked. My tasks.db was only 8.0mb. Restored all services on my node without side effect. sudo rm /var/lib/docker/swarm/worker/tasks.db

So far tasks.db looks good! No growth since upgrading to 19.03.9 https://docs.docker.com/engine/release-notes/#19039

I experienced this same issue. Couldn’t get it fixed, or at least not quickly enough, and decided to ditch Swarm for good old docker-compose, since I’m not actually using cluster features.

Was running on 17.12.0, so maybe this behaviour is fixed in newer versions. However, I’ve now learned to appreciate compose’s simplicity.

@drnybble From the logs:

swarm component could not be started before timeout was reached

There seems to be something preventing swarm to start on that Manager node you stopped. The logical followup is that when typing docker nodes ls, we then see that the node could not be reached because Swarm is not running (the heartbeat failure error).

It does not seem like this is the raft subsystem as it rightfully puts itself at the follower state and could process heartbeats (although it could be that it is restoring the state from a snapshot from the current leader in which case this could take some time and eat up on the time allocated for the startup).

The quorum isn’t lost either as we have two Manager nodes at the Ready state.

By any chance, were you running heavy workloads on the remaining Manager nodes when this happened or were you issuing a lot of docker commands (which would build up the state quickly)?

In your case I would attempt a restart of the manager and see if it could finally start successfully.

/cc @cyli @dperny You may have a much better idea of what is going on here 😄

@abronan sorry, the manager problem was authentic, but nodes had different problem (rexray volume plugin I uninstalled recently appeared again in the logs and blocked swarm subsystem’s start up), so I’d tinkered heavily already. When it will be appropriate I’ll break our lab intentionally again and I’ll write the results, I’ll gladly help with with resolving this.

Got into this state again after last upgrade to docker 18.05

$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
s6akodm9kto0uu38v0zj4xdu8     training-server     Down                Active                                  18.05.0-ce
ux3enftr6krhlrp8bbodo2q1l *   wsdocker6           Down                Active              Leader              18.05.0-ce