moby: [swarm] corrupted manger not able to leave a cluster with --force

Output of docker version:

Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 23:54:00 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 23:54:00 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 1.12.0
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 15
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge overlay host null
Swarm: active
 NodeID: 2jml8zh2ap8gnw3ghchc03g09
 Error: rpc error: code = 2 desc = raft: no elected cluster leader
 Is Manager: true
 ClusterID:
 Managers: 0
 Nodes: 0
 Orchestration:
  Task History Retention Limit: 0
 Raft:
  Snapshot interval: 0
  Heartbeat tick: 0
  Election tick: 0
 Dispatcher:
  Heartbeat period: Less than a second
 CA configuration:
  Expiry duration: Less than a second
 Node Address: 192.168.99.100
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.16-boot2docker
Operating System: Boot2Docker 1.12.0 (TCL 7.2); HEAD : e030bab - Fri Jul 29 00:29:14 UTC 2016
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 995.9 MiB
Name: master1
ID: XFBC:QIPK:ZJH5:MGSZ:D6IA:32XG:TUHL:6E43:HXOQ:FVLW:OY64:HWD4
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 72
 Goroutines: 218
 System Time: 2016-08-05T09:22:12.607462985Z
 EventsListeners: 4
Registry: https://index.docker.io/v1/
Labels:
 provider=virtualbox
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.): 5 VirtualBox VMs cluster

Steps to reproduce the issue:

  1. Create a cluster with 2 managers and 3 workers
  2. Turn off the laptop

Describe the results you received:

docker@master1:~$ docker swarm leave
Error response from daemon: You are attempting to leave cluster on a node that is participating as a manager. The only way to restore a cluster that has lost consensus is to reinitialize it with `--force-new-cluster`. Use `--force` to ignore this message.
docker@master1:~$ docker swarm leave --force
Error response from daemon: context deadline exceeded
docker@master1:~$

manager logs – logs1.txt

It might be related to https://github.com/docker/docker/issues/25395#issuecomment-237718994

Describe the results you expected:

docker@master1:~$ docker swarm leave --force
Node left the swarm.
docker@master1:~$

Additional information you deem important (e.g. issue happens only occasionally):

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 10
  • Comments: 78 (24 by maintainers)

Most upvoted comments

Follow up: ended up removing everything from: /var/lib/docker/swarm/* and just restarting docker. On both systems. That seems to have unjoined it. Not sure if this is the “correct” way (if there even is one at this point for this bug? 😃)

I have the same issue.

After restarting my docker manager node which exited abnormally because of disk full, docker swarm leave, docker swarm leave --force and docker swarm init --force-new-cluster doesn’t work with Error response from daemon: context deadline exceeded error.

The only solution is removing /var/lib/docker and restarting docker daemon.

I found follow solution “manual cleanup”!

  1. sudo service docker stop
  2. sudo rm -Rf /var/lib/docker/swarm
  3. sudo service docker start

@thaJeztah having the same problem, but in a weird position:

In my case, there were only 2 docker hosts (node01, and node02)-- and the problem is the error propagated to both. I decided to re-init the cluster, and so on node02 I am having the problem above. I can’t join an existing ,and I can’t leave/force leave. But the issue is that on node01, it has already left the cluster…

So there’s no healthy one to kick out the other one. How can I manually do this so clean up node02, so that I can re-join it to a new cluster.

On node02 getting:

# docker swarm leave
Error response from daemon: context deadline exceeded

# docker swarm leave --force
Error response from daemon: context deadline exceeded

and on node01 – there is no cluster.

Thansk @Fabryprog

I put it here just so people can see that there are more people with the mistake

Same situation here: one manager, one worker. Server failure. Both servers come up again. Both nodes are “active” but “down”. Multiple reboots, some waiting too. Still down. No way to leave the current swarm because of “context deadline exceeded”. The same for “leave --force”. No way to init a new swarm because there is still the old one.

I ended up doing the following:

sudo service docker stop
sudo rm /var/lib/docker/swarm
sudo service docker start
docker swarm init

This just happened in: 1.12.5

# docker swarm leave
Error response from daemon: context deadline exceeded

It looks like the issue still exists somewhere

In the allocator, a.netCtx is initialized too late, so if taskCreateNetworkAttachments is called as part of allocator initialization, it will dereference a nil pointer. This panic causes a deadlock because the wg.Done in allocator.go is done after initialization, instead of being deferred. I will open a PR against swarmkit to fix both problems.

cc @mrjana

This is happening in a reverse proxy setup

root@ip-172-31-7-30:/home/ubuntu# docker swarm leave Error response from daemon: You are attempting to leave the swarm on a node that is participating as a manager. Removing the last manager erases all current state of the swarm. Use --force to ignore this message.

root@ip-172-31-7-30:/home/ubuntu# docker swarm leave --force Node left the swarm.

Still inside the swarm

can’t leave with --force on 18.06.0-ce

Happened again on a “single-node Swarm”, version: 18.03.1-ce.

I can post any logs from the system if useful, but unfortunately I can’t reproduce this on purpose.

Context

  • The whole machine got hanged and had to be rebooted (investigating the cause, may be totally unrelated to Docker).
  • After restarting it, docker stats worked but no containers were being created with error: no suitable node (1 node not available for new tasks).
  • We had to executedocker swarm leave -f multiple times for it to work.
  • We use “restrictions” and “limits”.
  • We execute docker service update --image every few minutes on all the services. Since some services don’t yet have images available, our sudo journalctl -fu docker.service is full of this stuff: level=error msg="fatal task error" error="No such image:.

System

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:38 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:45 2018
  OS/Arch:      linux/amd64
  Experimental: false

docker-compose version 1.21.0-rc1, build 1d32980
4.13.0-39-generic

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 17.10
Release:	17.10
Codename:	artful

Docker info

Containers: 387
 Running: 20
 Paused: 0
 Stopped: 367
Images: 46
Server Version: 18.03.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 992
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: jejb6my7n50ulnilktgd2fxof
 Is Manager: true
 ClusterID: 0yus20uq607uzugzqbv9a1vzr
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.2.102
 Manager Addresses:
  192.168.2.102:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-39-generic
Operating System: Ubuntu 17.10
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 23.54GiB
Name: linuxcompany
ID: FLM4:OCTS:BWQR:ZLRQ:HVGG:VBSW:O5NE:IA2W:4Z6T:SS47:4BSE:A6ZT
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: daybreakhotels
Registry: https://index.docker.io/v1/
Labels:
 provider=generic
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

### Other logs

The two services in the logs below are configured to have 2 replicas each, they are visible in docker service ls but are never created because we currently don’t have images for them, probably unrelated, but just in case.

– Logs begin at Mon 2018-04-16 06:59:12 CEST. – May 07 00:09:24 linuxcompany dockerd[4572]: time=“2018-05-07T00:09:24.279477606+02:00” level=warning msg=“failed to deactivate service binding for container ourwebapp_webtest_alpha2.zzzzcqo6e92w9tgiedonjphg1” error=“No such container: ourwebapp_webtest_alpha2.zzzzcqo6e92w9tgiedonjphg1” module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w May 07 00:09:24 linuxcompany dockerd[4572]: time=“2018-05-07T00:09:24.279481142+02:00” level=warning msg=“failed to deactivate service binding for container ourwebapp_webtest_beta.2.zzy7upc4qtbwls8ni53c086tx” error=“No such container: ourwebapp_webtest_beta.2.zzy7upc4qtbwls8ni53c086tx” module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w May 07 00:09:24 linuxcompany dockerd[4572]: time=“2018-05-07T00:09:24.279496756+02:00” level=warning msg=“failed to deactivate service binding for container ourwebapp_webtest_alpha1.ja6p9qhvuemnkusgqa9sxdtdr” error=“No such container: ourwebapp_webtest_alpha1.ja6p9qhvuemnkusgqa9sxdtdr” module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w May 07 00:09:24 linuxcompany dockerd[4572]: time=“2018-05-07T00:09:24.279503004+02:00” level=warning msg=“failed to deactivate service binding for container ourwebapp_webtest_alpha2.zzzsnzby1ew5b1zrz632158q7” error=“No such container: ourwebapp_webtest_alpha2.zzzsnzby1ew5b1zrz632158q7” module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w May 07 00:09:24 linuxcompany dockerd[4572]: time=“2018-05-07T00:09:24.279551691+02:00” level=warning msg=“failed to deactivate service binding for container ourwebapp_webtest_alpha2.g38yg4rd0ftz464420ehj32cf” error=“No such container: ourwebapp_webtest_alpha2.g38yg4rd0ftz464420ehj32cf” module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w May 07 00:09:24 linuxcompany dockerd[4572]: time=“2018-05-07T00:09:24.279562380+02:00” level=warning msg=“failed to deactivate service binding for container ourwebapp_webtest_beta.2.zzzxuf7vqy4n1b9ni32h13tdi” error=“No such container: ourwebapp_webtest_beta.2.zzzxuf7vqy4n1b9ni32h13tdi” module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w

I removed a worker node from the manager, before leaving the swarm on the worker, then I got this when I tried docker swarm leave -f on the worker node.

@Fabryprog that solution will wipe any secrets stored in the swarm along with some other stuff. Provided the swarm isn’t locked, then a more precise clean up is:

# using systemd
sudo systemctl stop docker
# make sure to make a backup if you delete something wrong
sudo cp -ar /var/lib/docker/swarm/ /tmp/swarm.bak
sudo nano /var/lib/docker/swarm/state.json

state.json will look something like this:

[{"node_id":"nodeidofhealthynode","addr":"123.123.123.123:2377"},
{"node_id":"nodeidofunhealthynode","addr":"123.123.123.124:2377"}]

You want to delete any entries to unhealthy nodes. so just one healthy manager node is left

[{"node_id":"nodeidofhealthynode","addr":"123.123.123.123:2377"}]

lastly restart docker

sudo systemctl start docker

I have same problem. Manager is not running (but swarm is running), worker is inside a swarm!!!

docker swarm leave --force Error response from daemon: context deadline exceeded

Server Version: 17.05.0-ce

@ventz Funny I encountered the bug again after I switched to a new router. I have three managers this time but the second node will not join nor leave.

Removing the files on /var/lib/docker/swarm and restarting docker did the trick. Before doing that I tried just clearing the state.json and docker-state.json from the directory. It has the IPs of the old nodes (before I switched to a new router). It didn’t work. I had to take out the whole contents of the swarm directory

@thaJeztah thanks for the links. They are informative.

The only reason I had two was I accidentally joined the other node as manager than a worker (I was playing with my Pi’s). So it was a big surprise when this issue bit me. I never expected it would be a big deal.

On a hindsight earlier before I encountered the bug I was setting up a Zookeeper ensemble and wondered why Docker didn’t recommend an odd number of managers as well. It turns out I didn’t read enough.

@alias1 swarm leave --force timing out is not expected behavior. If #27594 didn’t work for you then there have been some more recent fixes https://github.com/docker/swarmkit/pull/1857

@rogaha sorry for the late response. No this was on 1.12.1

I have the same issue.

Restarting the manager results in this:

docker node ls
Error response from daemon: rpc error: code = 2 desc = raft: no elected cluster leader

I am facing the same issue on Amazon EC2 Ubuntu AMI.


ubuntu@vm-swarm-manager:~$ sudo docker swarm leave --force
Error response from daemon: context deadline exceeded