moby: [swarm] corrupted manger not able to leave a cluster with --force
Output of docker version
:
Client:
Version: 1.12.0
API version: 1.24
Go version: go1.6.3
Git commit: 8eab29e
Built: Thu Jul 28 23:54:00 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.0
API version: 1.24
Go version: go1.6.3
Git commit: 8eab29e
Built: Thu Jul 28 23:54:00 2016
OS/Arch: linux/amd64
Output of docker info
:
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 1
Server Version: 1.12.0
Storage Driver: aufs
Root Dir: /mnt/sda1/var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 15
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge overlay host null
Swarm: active
NodeID: 2jml8zh2ap8gnw3ghchc03g09
Error: rpc error: code = 2 desc = raft: no elected cluster leader
Is Manager: true
ClusterID:
Managers: 0
Nodes: 0
Orchestration:
Task History Retention Limit: 0
Raft:
Snapshot interval: 0
Heartbeat tick: 0
Election tick: 0
Dispatcher:
Heartbeat period: Less than a second
CA configuration:
Expiry duration: Less than a second
Node Address: 192.168.99.100
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.16-boot2docker
Operating System: Boot2Docker 1.12.0 (TCL 7.2); HEAD : e030bab - Fri Jul 29 00:29:14 UTC 2016
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 995.9 MiB
Name: master1
ID: XFBC:QIPK:ZJH5:MGSZ:D6IA:32XG:TUHL:6E43:HXOQ:FVLW:OY64:HWD4
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 72
Goroutines: 218
System Time: 2016-08-05T09:22:12.607462985Z
EventsListeners: 4
Registry: https://index.docker.io/v1/
Labels:
provider=virtualbox
Insecure Registries:
127.0.0.0/8
Additional environment details (AWS, VirtualBox, physical, etc.): 5 VirtualBox VMs cluster
Steps to reproduce the issue:
- Create a cluster with 2 managers and 3 workers
- Turn off the laptop
Describe the results you received:
docker@master1:~$ docker swarm leave
Error response from daemon: You are attempting to leave cluster on a node that is participating as a manager. The only way to restore a cluster that has lost consensus is to reinitialize it with `--force-new-cluster`. Use `--force` to ignore this message.
docker@master1:~$ docker swarm leave --force
Error response from daemon: context deadline exceeded
docker@master1:~$
manager logs – logs1.txt
It might be related to https://github.com/docker/docker/issues/25395#issuecomment-237718994
Describe the results you expected:
docker@master1:~$ docker swarm leave --force
Node left the swarm.
docker@master1:~$
Additional information you deem important (e.g. issue happens only occasionally):
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 10
- Comments: 78 (24 by maintainers)
Follow up: ended up removing everything from: /var/lib/docker/swarm/* and just restarting docker. On both systems. That seems to have unjoined it. Not sure if this is the “correct” way (if there even is one at this point for this bug? 😃)
I have the same issue.
After restarting my docker manager node which exited abnormally because of disk full,
docker swarm leave
,docker swarm leave --force
anddocker swarm init --force-new-cluster
doesn’t work withError response from daemon: context deadline exceeded
error.The only solution is removing
/var/lib/docker
and restarting docker daemon.I found follow solution “manual cleanup”!
@thaJeztah having the same problem, but in a weird position:
In my case, there were only 2 docker hosts (node01, and node02)-- and the problem is the error propagated to both. I decided to re-init the cluster, and so on node02 I am having the problem above. I can’t join an existing ,and I can’t leave/force leave. But the issue is that on node01, it has already left the cluster…
So there’s no healthy one to kick out the other one. How can I manually do this so clean up node02, so that I can re-join it to a new cluster.
On node02 getting:
and on node01 – there is no cluster.
Thansk @Fabryprog
I put it here just so people can see that there are more people with the mistake
Same situation here: one manager, one worker. Server failure. Both servers come up again. Both nodes are “active” but “down”. Multiple reboots, some waiting too. Still down. No way to leave the current swarm because of “context deadline exceeded”. The same for “leave --force”. No way to init a new swarm because there is still the old one.
I ended up doing the following:
This just happened in: 1.12.5
It looks like the issue still exists somewhere
In the allocator,
a.netCtx
is initialized too late, so iftaskCreateNetworkAttachments
is called as part of allocator initialization, it will dereference a nil pointer. This panic causes a deadlock because thewg.Done
inallocator.go
is done after initialization, instead of being deferred. I will open a PR against swarmkit to fix both problems.cc @mrjana
This is happening in a reverse proxy setup
root@ip-172-31-7-30:/home/ubuntu# docker swarm leave Error response from daemon: You are attempting to leave the swarm on a node that is participating as a manager. Removing the last manager erases all current state of the swarm. Use
--force
to ignore this message.root@ip-172-31-7-30:/home/ubuntu# docker swarm leave --force Node left the swarm.
Still inside the swarm
can’t leave with --force on 18.06.0-ce
Happened again on a “single-node Swarm”, version:
18.03.1-ce
.I can post any logs from the system if useful, but unfortunately I can’t reproduce this on purpose.
Context
docker stats
worked but no containers were being created with error:no suitable node (1 node not available for new tasks)
.docker swarm leave -f
multiple times for it to work.docker service update --image
every few minutes on all the services. Since some services don’t yet have images available, oursudo journalctl -fu docker.service
is full of this stuff:level=error msg="fatal task error" error="No such image:
.System
Docker info
### Other logs
The two services in the logs below are configured to have 2 replicas each, they are visible in
docker service ls
but are never created because we currently don’t have images for them, probably unrelated, but just in case.I removed a worker node from the manager, before leaving the swarm on the worker, then I got this when I tried
docker swarm leave -f
on the worker node.@Fabryprog that solution will wipe any secrets stored in the swarm along with some other stuff. Provided the swarm isn’t locked, then a more precise clean up is:
state.json
will look something like this:You want to delete any entries to unhealthy nodes. so just one healthy manager node is left
lastly restart docker
I have same problem. Manager is not running (but swarm is running), worker is inside a swarm!!!
Server Version: 17.05.0-ce
@ventz Funny I encountered the bug again after I switched to a new router. I have three managers this time but the second node will not join nor leave.
Removing the files on /var/lib/docker/swarm and restarting docker did the trick. Before doing that I tried just clearing the state.json and docker-state.json from the directory. It has the IPs of the old nodes (before I switched to a new router). It didn’t work. I had to take out the whole contents of the swarm directory
@thaJeztah thanks for the links. They are informative.
The only reason I had two was I accidentally joined the other node as manager than a worker (I was playing with my Pi’s). So it was a big surprise when this issue bit me. I never expected it would be a big deal.
On a hindsight earlier before I encountered the bug I was setting up a Zookeeper ensemble and wondered why Docker didn’t recommend an odd number of managers as well. It turns out I didn’t read enough.
@alias1
swarm leave --force
timing out is not expected behavior. If #27594 didn’t work for you then there have been some more recent fixes https://github.com/docker/swarmkit/pull/1857@rogaha sorry for the late response. No this was on 1.12.1
I have the same issue.
Restarting the manager results in this:
I am facing the same issue on Amazon EC2 Ubuntu AMI.