moby: single manager docker swarm stuck in Down state after reboot
Description
A single-manager swarm cluster gets into the following state after a abrupt reboot:
# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
ux3enftr6krhlrp8bbodo2q1l * wsdocker6 Down Active Leader
The manager can’t leave the swarm, as the docker swarm leave --force
command times out.
The above state shouldn’t be possible to be in. By printing the above, the node must be a manager, and since there’s only one manager, it has to infer that itself is that manager, and it thus can’t be Down
.
Steps to reproduce the issue:
- Create a swarm with 1 manager
- Abruptly reboot
- Profit
Describe the results you received:
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version
:
docker version
Client:
Version: 17.06.2-ce
API version: 1.30
Go version: go1.8.3
Git commit: cec0b72
Built: Tue Sep 5 20:00:17 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.2-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: cec0b72
Built: Tue Sep 5 19:59:11 2017
OS/Arch: linux/amd64
Experimental: true
Output of docker info
:
root@wsdocker6:/etc/systemd/system# docker info
Containers: 83
Running: 1
Paused: 0
Stopped: 82
Images: 103
Server Version: 17.06.2-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: pending
NodeID: ux3enftr6krhlrp8bbodo2q1l
Is Manager: true
ClusterID: nhkjyh6xo2jdu2txkjt95xfua
Managers: 1
Nodes: 1
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Root Rotation In Progress: false
Node Address: x.x.x.x
Manager Addresses:
x.x.x.x:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-93-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.63GiB
Name: wsdocker6
ID: KRY5:PQU2:2KPP:7QR6:4EXD:QFXT:KKVB:CZVN:OQUU:S5A3:Z5ZT:XQGN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: xxx
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.):
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 6
- Comments: 53 (13 by maintainers)
For me, I had this problem with a swarm worker (on CentOS 7 - Docker version: 18.06.0-ce). The only error I could find was:
Looking around in /var/lib/docker there was 1 huge file (besides the normal volumes etc). tasks.db was 15GB!
I stopped docker, deleted the file and everything was fine. The manager sees the worker again. Hopefully this helps others that come across this issue.
We’ve made a patch to help keep the task.db size down. It does not fix issues with already large task.db files. This should be in the next release.
Inspecting the node id I get among other things:
Closing since this should be resolved in 19.03.9
Thanks for reporting back!
There’s a WIP PR in SwarmKit; https://github.com/docker/swarmkit/pull/2917
This worked for me. Thanks @ryandaniels , you’re awesome.
Docker version 19.03.5, build 633a0ea
Removing tasks.db (on the node that will not rejoin) immediately resolved the problem for me as well.
Solution to remove task.db worked for me aswell 👍
docker -v Docker version 19.03.5, build 633a0ea838 one node swarm the same issue
Seeing the same problem on 18.05-ce. I stopped the daemon for about 5 minutes using service docker stop. On service docker start I see: ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION nj4oo60bh6uvovtmehgucn34g ions1 Ready Active Reachable 18.05.0-ce k2quh59urx244fo487og0z669 * ions2 Down Active Reachable 18.05.0-ce s97wdji24ymbnuxs08snaedbq ions3 Ready Active Leader 18.05.0-ce
I’ll let it sit & see if it fixes itself.
Logs: https://gist.github.com/drnybble/cd1d4deb72282cb23803373dec1a327c
Turned on debug, all I see is:
Aug 10 07:39:17 ions2 systemd[1]: Starting Docker Application Container Engine… Aug 10 07:39:39 ions2 dockerd[18588]: time=“2018-08-10T07:39:39.057410260-07:00” level=error msg=“swarm component could not be started before timeout was reached” Aug 10 07:39:39 ions2 dockerd[18588]: time=“2018-08-10T07:39:39.057532327-07:00” level=info msg=“Daemon has completed initialization” A
Going to bed, when I woke up, the host was healthy. Looking at the logs it seems like the host became healthy after 1 hour. Again, guessing that either some time-out happened or some raft log processing that took a long time.
I found this issue after also having issues bringing up my single-node docker swarm after reboots. For me, deleting
/var/lib/docker/swarm/worker/tasks.db
worked as a workaround in the sense that i could successfully rundocker stack deploy
again (which otherwise didn’t work as the swarm wasn’t running), it not only felt wrong and dangerous but after automating that to be done before every docker-daemon start - i had another problem: the swarm didn’t automatically start again after reboots (i guess this is stored inside the task.db somehow).So i wanted to document my journey and learnings here. If this is the wrong place or there is a better place to document this, please let me know.
I’m using a raspberrypi 4b with Fedora 35 IoT Edition (aarch64)
I found that the problem is that the raspberry is extremely bad at keeping the time. My hardware clock was over 1 month behind. As i’m running Fedora - i don’t know how or if i can sync the hardware clock from the Raspberry via Fedora (i assume that only works in raspbian?).
Anyway, i found out about the timedrift by checking journalctl:
Notice how the logs are for 15th Nov (today is 24th Dec). I created the swarm today. So the issue is that the certificates that docker creates internally for management of the swarm nodes are invalid as when the raspberrypi boots, it’s thinking it’s 15th Nov and the cert begins to be valid on the 24th Dec. So it’s not yet valid.
Then i checked
journalctl -b
to see that a lot of log lines are from the past, even though the (re)boot happened just 5 minutes ago.chronyd seems to be the daemon that Fedora uses by default to ensure correct system time. I disabled that and use the systemd native one:
systemctl stop chronyd && systemctl disable chronyd && systemctl enable systemd-timesyncd && systemctl start systemd-timesyncd
Now i ensure the docker service waits for the system time to be set by adding time-set.target to the
After=
section of the docker unit-filevia
systemctl edit docker
.I rebooted. It showed all docker log-entries with the correct time. However, my swarm still didn’t persist between reboots. It showed:
Which i solved by completely tearing down my (single-node) swarm and recreating it:
docker swarm leave --force
followed by adocker swarm init [...]
command.To my surprise my data in my containers was still kept - but please be sure what you are doing with your swarm config and don’t blame me for data-loss.
Now with docker enabled (
systemctl enable docker
) my swarm automatically reboots with my machine - as it should be.P.S.: i also created a pull-request #43107 so this might work out-of-the-box in the future.
I can’t believe that worked. My tasks.db was only 8.0mb. Restored all services on my node without side effect.
sudo rm /var/lib/docker/swarm/worker/tasks.db
So far tasks.db looks good! No growth since upgrading to 19.03.9 https://docs.docker.com/engine/release-notes/#19039
I experienced this same issue. Couldn’t get it fixed, or at least not quickly enough, and decided to ditch Swarm for good old
docker-compose
, since I’m not actually using cluster features.Was running on 17.12.0, so maybe this behaviour is fixed in newer versions. However, I’ve now learned to appreciate
compose
’s simplicity.@drnybble From the logs:
There seems to be something preventing swarm to start on that Manager node you stopped. The logical followup is that when typing
docker nodes ls
, we then see that the node could not be reached because Swarm is not running (theheartbeat failure
error).It does not seem like this is the raft subsystem as it rightfully puts itself at the follower state and could process heartbeats (although it could be that it is restoring the state from a snapshot from the current leader in which case this could take some time and eat up on the time allocated for the startup).
The quorum isn’t lost either as we have two Manager nodes at the
Ready
state.By any chance, were you running heavy workloads on the remaining Manager nodes when this happened or were you issuing a lot of docker commands (which would build up the state quickly)?
In your case I would attempt a restart of the manager and see if it could finally start successfully.
/cc @cyli @dperny You may have a much better idea of what is going on here 😄
@abronan sorry, the manager problem was authentic, but nodes had different problem (rexray volume plugin I uninstalled recently appeared again in the logs and blocked swarm subsystem’s start up), so I’d tinkered heavily already. When it will be appropriate I’ll break our lab intentionally again and I’ll write the results, I’ll gladly help with with resolving this.
Got into this state again after last upgrade to docker 18.05