moby: Docker system prune stuck in locked mode
I run command docker system prune yesterday, it took some time and then my SSH session was disconnected from different reason.
Unfortunately I am getting now:
Error response from daemon: a prune operation is already running.
Obviously there is a lock and prune command is not running anymore.
Steps to reproduce the issue:
- docker system prune
Describe the results you received: “a prune operation is already running.”
Describe the results you expected: Automatic unlock after a certain amount of time, self healing or posibillity to unlock
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version
:
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:53 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 40
Running: 25
Paused: 0
Stopped: 15
Images: 261
Server Version: 17.12.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: bu0xu0nf5r9g1ydlblnjf4rdi
Is Manager: true
ClusterID: zkqa5nrgqn042xedq172mwz5v
Managers: 3
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 2
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.10.10.3
Manager Addresses:
10.10.10.1:2377
10.10.10.2:2377
10.10.10.3:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-112-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 62.88GiB
Name: xxxxx
ID: CW4M:4OEM:N3QG:UHYR:NF64:SZVT:IDGC:7O6L:LILC:UPYG:S6TG:5URD
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.): Physical, private cluster
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 22
- Comments: 47 (16 by maintainers)
I can confirm that prune stuck because of a non-responding container. When I kill the container first by
kill -9 PROCESS_ID
where the process Id I get fromps aux | grep docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/CONTAINER_ID
The problem is that you need to know that there is a container that does not respond on docker 😕 The container works (i.e. node.js works fine) but just docker is not able to even inspect it.
Btw this container should not even be there because we run
docker service update...
with the:latest
image. Docker created another container and this was not killed. So there were two running containers with two different versions.Solution described by https://github.com/moby/moby/issues/36447#issuecomment-373273071 worked for me as well. Thanks very much for that detailled report!
Finding defect containers by the way is very easy with
for i in $(docker ps -q) ; do echo $i && docker inspect $i ; done
if there will be no prompt in the last line but a hash, this is the defect container.
Hi! I have the same problem and the exact same version. Docker is working in swarm mode. The problem is that there’s no directory
/var/lib/docker/
. And I can’t kill the process. Any ideas?@cgoeller Thanks. Have you tried with 17.12.1 yet? It might be fixed there, but no guarantees as there are a few places where this deadlock is happening some of which was fixed in 17.12.1. There are several more patches coming in 18.03.1.
Looks like the right commits, yes.
Can confirm that prune hangs because of unresponsive container that stuck during exec. We are running Kubernetes and docker 17.12.0-ce and have the same issue sometimes because of daily cron for pruning the system.
We can see that container with
docker ps -a
with the Created status that hanged:17492dcb9f49 gcr.io/google_containers/pause-amd64:3.0 "/pause" 2 days ago Created k8s_POD_...
And also corresponding containerd-shim and docker-runc processes:
root 12651 0.0 0.0 7512 3632 ? Sl Mar16 0:00 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/17492dcb9f4989953cac832453b0132d28bafa622ff93bac2824ce0bebf698f9 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root 13405 0.0 0.0 122168 9520 ? Sl Mar16 0:00 docker-runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/17492dcb9f4989953cac832453b0132d28bafa622ff93bac2824ce0bebf698f9/log.json --log-format json start 17492dcb9f4989953cac832453b0132d28bafa622ff93bac2824ce0bebf698f9
Dockerd stack trace: goroutine-stacks-2018-03-19T032810-0500.log Containerd stack trace: containerd-trace.log
Killing the containerd-shim process removes the unresponsive container. Then after terminating the hanged prune command it can be executed again without
a prune operation is already running
message.Unfortunately, i cannot find a way to reproduce the issue, so cannot tell if bumping docker version will help, but there somehow exist similar issues with hanged containers even on 18.03.
It would be very helpful if somebody confirm that that prune does not stuck on >=17.12.1.
Update: I had to restart the server,
service docker restart
did not work and the command stuck - I killed it after 10 minutes.