moby: Can't stop docker container

Description

Can’t stop container.

I’m starting and removing containers concurrently using docker-compose. Sometimes it fails to remove the containers.

I checked that I can’t docker stop the container. The command hangs and after change docker daemon to debug I just see this line when I run the command. dockerd[101922]: time="2018-01-04T15:54:07.406980654Z" level=debug msg="Calling POST /v1.35/containers/4c2b5e7f466c/stop"

Steps to reproduce the issue:

  1. Run tests in jenkins
  2. Eventually it fails to remove containers.

Describe the results you received:

Can’t stop container.

Describe the results you expected:

Container should have been stopped. And then removed.

Additional information you deem important (e.g. issue happens only occasionally):

Issue happens only occasionally

Output of docker version:

Client:
 Version:	17.12.0-ce
 API version:	1.35
 Go version:	go1.9.2
 Git commit:	c97c6d6
 Built:	Wed Dec 27 20:10:14 2017
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	17.12.0-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.2
  Git commit:	c97c6d6
  Built:	Wed Dec 27 20:12:46 2017
  OS/Arch:	linux/amd64
  Experimental:	false

Output of docker info:

Containers: 6
 Running: 1
 Paused: 0
 Stopped: 5
Images: 75
Server Version: 17.12.0-ce
Storage Driver: devicemapper
 Pool Name: docker-253:0-33643212-pool
 Pool Blocksize: 65.54kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Udev Sync Supported: true
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Data Space Used: 31.43GB
 Data Space Total: 107.4GB
 Data Space Available: 75.95GB
 Metadata Space Used: 35.81MB
 Metadata Space Total: 2.147GB
 Metadata Space Available: 2.112GB
 Thin Pool Minimum Free Space: 10.74GB
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 1
 Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.11.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 36
Total Memory: 117.9GiB
Name: jenkins-node.com
ID: 5M6L:G2KF:732H:Y7RF:QHNO:3XM4:U6RV:U5QR:ANPA:7XRZ:M3S4:GUZC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 37
 Goroutines: 51
 System Time: 2018-01-04T16:02:36.54459153Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: devicemapper: usage of loopback devices is strongly discouraged for production use.
         Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 35
  • Comments: 146 (48 by maintainers)

Commits related to this issue

Most upvoted comments

I get the same issue, though without using docker-compose. I’m using docker swarm. Same thing though, I occasionally get containers that neither docker swarm nor I with the docker CLI can stop. This causes docker swarm to end up collecting more replicas than desired that it can’t scale down. Sometimes these replicas can still service requests and receive traffic. The only way to remove the containers is to restart docker on the effected node.

The “stable” channel 17.12.0 version still has this bug; if it’s fixed, could that PR be back-ported to a patch release 17.12.1? The stable channel is pretty unstable, if people are having to revert all the way to 17.09 or resort to an edge release.

Thank you all for confirming; I’ll go ahead and close this issue.

If you still run into this on Docker 18.03.1 or above; please open a new issue with details

It’s being worked on. Thanks!

Sorry to warm up this thread - it looks like the fix is coming 😉 - but I have a quick question: We’re seeing the exact same issue on docker-ce-17.12 since we added HEALTHCHECKs to our Dockerfiles. The containers without HEALTCHECK specified in their Dockerfiles stop just fine. Could this be related to the HEALTHCHECKs or is this just a coincidence?

Cheers Phil

@timdau We are still on 17.09 in production because this is the most stable version for us due to these “unstoppable containers”

We have stopped using 17.12 completely and rolled back to 17.09 because of this problem on 17.12 (macOS and apparently Linux as well).

This is a critical, persistent problem.

And unfortunately I have not found way to recreate it except using docker a lot.

I have the same issue with docker swarm. I remove one of multuple docker stacks, but only some of the containers in the stack are removed, while some containers hang around. Commands to docker inspect or docker rm on the hung containers just hang on the command line until I Ctrl-c. Need to reboot to get the containers removed. Did not have the issue in 17.09, only after upgrading to 17.12.0-ce (also had the problem on 17.12.0-ce-rc4).

I have the issue on an Azure VM: docker info

 Running: 83
 Paused: 0
 Stopped: 12
Images: 579
Server Version: 17.12.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: hy0kx44q5m9jg0lc1n5ylxkw6
 Is Manager: true
 ClusterID: ordhsz694y98k3r4604ksc937
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 2
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.0.0.10
 Manager Addresses:
  10.0.0.10:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-104-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 27.47GiB
Name: build-agent-vm001
ID: S7WY:RCKF:G3P7:TI3H:MJ2F:UXZ3:C5DS:YQG3:OPF4:V4RS:5EQ7:AWG4
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

I also have the same issue on Docker for Mac (Edge: 17.12): docker info

 Running: 65
 Paused: 0
 Stopped: 45
Images: 607
Server Version: 17.12.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: qfzh0tqkchl2m42uhju7k3ml4
 Is Manager: true
 ClusterID: q14zy6epqkpx0w112wusdtd3u
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 2
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.65.3
 Manager Addresses:
  192.168.65.3:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.60-linuxkit-aufs
Operating System: Docker for Mac
OSType: linux
Architecture: x86_64
CPUs: 6
Total Memory: 5.817GiB
Name: linuxkit-025000000001
ID: DSXX:YVTO:DLFW:MN3X:MTJC:3EGK:MUYT:6JMN:C2NC:TQMW:BE44:3P6H
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 260
 Goroutines: 491
 System Time: 2018-01-09T00:13:09.053688513Z
 EventsListeners: 28
HTTP Proxy: docker.for.mac.http.internal:3128
HTTPS Proxy: docker.for.mac.http.internal:3128
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

@mavogel I had the same problem with freezing docker containers. The solution for me was that if I move logging from /dev/stderr to internal file inside docker container then the problem is gone. Probably there is some disk issue when container logs to /dev/stderr and probably it is the case for most of problems.

I’m experiencing the same issue in multiple servers using 17.12. As @rfay said, it didn’t happen on 17.09.

Checking the changelog, a major difference between 17.12 and 17.09 is that, since 17.11, Docker is based on containerd. So, as the evidences seem to indicate this is an issue in the runtime, maybe it would be good to investigate down this path.

We also have the same issue on 17.12.1-ce

Over time containers enters a state where docker ps and docker inspect hangs. Forcing the swarm to redeploy the service makes the container enter a zombie state (Desired state: Shutdown, Current status: Running).

docker kill does not work. One way to kill the container is ps aux | grep [container_id] and then kill [process_id]

Is there any information needed that I can provide?

+1 for a patch release 17.12.1

The 18.03.1 version seems to fixed this issue. (or mitigated was said @cpuguy83)

I tested in 4 clusters.

There are multiple issues at play here and there have been multiple fixes that take care of different areas.

The same happens to me in docker-CE 17.12.0

17.12.1 has been out for some time now. It doesn’t fix all issues but it does fix some. Please update. There are other fixes available in 18.03.0, but it may be worth waiting for 18.03.1 which should be out soon.

This issue is still open because we understand it’s not fixed and it is being worked on. If you want to help there are a number of ways to contribute outside of narrowing down cases… e.g. specific/consistent repro steps, stack traces from an updated docker instance (and containerd and a containerd-shim also helpful), etc.

Coming on here and making false claims an silly posturing is not helpful at all.

The same happens to me in docker-CE 17.12.0 (in 3 clusters), I rolling back to 17.09. It’s incredibly that Docker have now this kind of critical bugs in two LTS versions, and don’t fix it… I understand that maybe it’s difficult to reproduce, but this happen to a lots of persons…

¿It’s because now there are an EE version, and there efforts are now in that version EE 2.2.x (Docker 17.06.x)?

@jcberthon really curious about the result. Cause I’m seeing people that have problems with 18.03.0 aswell. @JnMik We decided to downgrade to 17.09.1 untill this issue is resolved since it was happening way to often on 17.12 and 18.02.

@cpuguy83 Is there a list somewhere of all of the issues related to this problem? That way we can know for sure when this issue is resolved and its safe to upgrade.

18.03.1 is not there yet at the release page: https://docs.docker.com/release-notes/docker-ce/ or am I blind?

That doc are out of date, you can see here: https://github.com/docker/docker-ce/releases/tag/v18.03.1-ce released 11 days ago.

I’ve made some changes to my infrastructure to afford myself the luxury of being able to spend some time collecting logs/information the next time that this happens on my production systems.

I’m currently on Ubuntu 16.04.4 LTS running docker-ce 18.03.1 and Linux Kernel 4.13.0-39-generic x86_64.

Can someone confirm that this is all of the information that would need to be collected in order to provide enough information to help troubleshoot this issue?

  1. docker inspect {container-id} > docker-inspect-container.log
  2. ps -aux | grep {container-id} to get docker-containerd-shim pid
  3. To get a stack dump from docker-containerd-shim do kill -s SIGUSR1 {docker-containerd-shim-pid}. This should generate a stack trace in the logs for dockerd.
  4. sudo journalctl -u docker.service --since today > docker-service-log.txt
  5. docker info
  6. docker version

@victorvarza see the earlier comments: https://github.com/moby/moby/issues/35933#issuecomment-378957035 - if you’re on 17.12; at least upgrade to 17.12.1, but given that 17.12 reached EOL, consider 18.03 (but you may want to wait for 18.03.1, which will have some fixes)

We are also sticking to 17.09.1 because newer versions are not working for us.

I think I have a related problem

I’ve upgrade our dev environement to the latest 17.12.1-ce, build 7390fc6 last week and it’s the first time I see this error.

I developer tried to update an application, and swarm is unable to delete an old container of the previous version on a specific node on the cluster. I found out because developers started complaining about a white page syndrom in a intermittent way.

When I do a docker service ps on the service, here’s what I see : https://www.screencast.com/t/LXAfmddRDp The old container is running but in shutdown state.

ON the node, I see the container as if it were running in a healthy way : https://www.screencast.com/t/ABKVYxNUQ

And from the “docker service ls” , I have more containers than expected https://www.screencast.com/t/0Po8Sqs0Jr

I tried running docker kill and docker inspect on the container from the node but it’s not working. I do not have any specific messager in dmesg.

That’s all I can tell from now, I’ll remove the stack and launch it again so developers are enable to continue their work.

Hope it helps

EDIT:

  • Stack rm did not fix the issue, the zombie container was still on the node
  • Setting the node availability to drain did NOT fix the issue, node was left with only the zombie container on it
  • service docker restart don’t respond
  • Finally rebooted the node and all the containers were not there anymore.

I saw some error like this on the node during the process

ar 13 10:04:10 server-name dockerd: time=“2018-03-13T10:04:10.406196465-04:00” level=error msg=“Failed to load container f5d6bb74d6b37871b72b5f27d46f8705a6b66cba7afb50706bbf68b764facb24: open /var/lib/docker/containers/f5d6bb74d6b37871b72b5f27d46f8705a6b66cba7afb50706bbf68b764facb24/config.v2.json: no such file or directory” Mar 13 10:04:10 server-name dockerd: time=“2018-03-13T10:04:10.408039262-04:00” level=error msg=“Failed to load container fd5ac869991b263a28c36bddf9b2847a8a26e2b7d59fa033f85e9616b0b7cb7a: open /var/lib/docker/containers/fd5ac869991b263a28c36bddf9b2847a8a26e2b7d59fa033f85e9616b0b7cb7a/config.v2.json: no such file or directory”

EDIT2: Found somebody else with the same issue : https://github.com/moby/moby/issues/36553

I experience the same bug. It is not consistent though. I don’t see a pattern yet but it does happen.

I am running Docker for Mac Version 17.12.0-ce-mac46 (21698). I am not running Docker in Docker.

Container is created by docker-compose up.

Yes I can see that container is still running but stop or kill just hangs and does nothing.

10:13:13 Alexei-Workstation /Users/alexei.chekulaev/Projects/SBD-MASTER
$ docker ps
CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS                    PORTS                                                    NAMES
f0e36d3589d3        docksal/cli:1.3-php7      "/opt/startup.sh sup…"   44 hours ago        Up 28 minutes (healthy)   22/tcp, 9000/tcp                                         sbdmaster_cli_1
b93c84c9a3a3        docksal/ssh-agent:1.0     "/run.sh ssh-agent"      44 hours ago        Up 29 minutes                                                                      docksal-ssh-agent
91ce00eb35fa        docksal/dns:1.0           "/opt/entrypoint.sh …"   44 hours ago        Up 29 minutes             192.168.64.100:53->53/udp                                docksal-dns
ae867cca0f21        docksal/vhost-proxy:1.1   "docker-entrypoint.s…"   44 hours ago        Up 29 minutes             192.168.64.100:80->80/tcp, 192.168.64.100:443->443/tcp   docksal-vhost-proxy
10:13:17 Alexei-Workstation /Users/alexei.chekulaev/Projects/SBD-MASTER
$ docker stop f0e36d3589d3
^C
10:16:03 Alexei-Workstation /Users/alexei.chekulaev/Projects/SBD-MASTER
$ docker kill f0e36d3589d3
^C
10:30:51 Alexei-Workstation /Users/alexei.chekulaev/Projects/SBD-MASTER

(You can see that minutes passed before I pressed Ctrl-C)

In another Terminal I tried to start another docker-compose project, that’s what I have seen in the output the first time:

$ docker-compose up
rm: can't remove '/.ssh/id_rsa.pub': Stale file handle
rm: can't remove '/.ssh/authorized_keys': Stale file handle
rm: can't remove '/.ssh/id_rsa2.pub': Stale file handle
rm: can't remove '/.ssh/known_hosts': Stale file handle
rm: can't remove '/.ssh/id_test': Stale file handle
rm: can't remove '/.ssh/id_test.pub': Stale file handle
rm: can't remove '/.ssh/id_rsa2': Stale file handle
rm: can't remove '/.ssh/id_dsa': Stale file handle
rm: can't remove '/.ssh/id_boot2docker': Stale file handle
rm: can't remove '/.ssh/id_sbd.pub': Stale file handle
rm: can't remove '/.ssh/id_sbd': Stale file handle
rm: can't remove '/.ssh/id_rsa': Stale file handle
rm: can't remove '/.ssh/id_boot2docker.pub': Stale file handle
rm: can't remove '/.ssh': Directory not empty
Starting services...
Creating network "demonodb_default" with the default driver
Creating demonodb_cli_1 ... done
Creating demonodb_cli_1 ... 
Creating demonodb_web_1 ... done

Another project started fine but with these errors about stale file names above. Subsequent stops and starts of the another project did not throw any errors and worked fine.

These files are on a named volume. The volume is mounted as ro in docker-compose, so I’m not sure why there are “cant remove” messages.

Restarting Docker daemon solves the issue… temporarily. I forgot to do docker inspect and already restarted daemon but I think inspect would just hang like stop and kill do.

UPDATE: wanted to note that the container with issues has healthcheck on it. Looks like this might be the culprit.

We are also experiencing non-responsive docker-deamon on some commands:

currently I cannot

docker rmi docker system prune -f docker exec docker logs

this happends on multiple engines, all running 17.12.

seems related to https://github.com/moby/moby/issues/35408

Seems like 18.03.1 has fixed the issue for me. I have been using it for a week locally, but did not experience the issue, that was easily reproducible within a day otherwise.

It is interesting because for my original issue updating to 18.02 was the solution. Well at least so far so good.

In the last (almost) 5 days (that is when I upgraded to Docker CE 18.03.0), I did not encounter the issue.

It does not mean it is solved in 18.03.0, it is too early to tell. But at least this is less often occuring. Before I had the problem at least every 2 or 3 days. 🤞

Hi @cpuguy83 I had to reboot the host (before I saw your message), because restarting the docker.service did not work, and killing the processes did not help restarting the containers afterwards. So I went through a complete reboot cycle rather than fiddling until I get back to a clean state.

So I need to wait for next lock-up before I can report the stack dump for docker-containerd-shim. I’m now on 18.03.0 though…

Anyway thanks for getting back quickly to me 😃

@jcberthon Thanks, this seems like the same issue as above on first look. To get a stack dump from docker-containerd-shim do kill -s SIGUSR1 <docker-containerd-shim-pid>. This should generate a stack trace in the logs for dockerd.

I’m again stuck with that problem. This time when trying to upgrade from 17.12.1 to 18.03.0. The upgrade process is stuck, most containers are still running (because the application are still up and running, but docker ps is stuck).

I’ve done a dump of the docker-containerd socket, here is the gist: https://gist.github.com/jcberthon/143c3e6b7c9e5fc8f18c9204ca1bedf6

I do not know how to do a dump of docker-containerd-shim.

@mhaamann Thanks! Digging deeper…

This looks like it is stuck getting the state of the container from the shim process. Are you able to trigger a stack trace on the shim? kill -SIGUSR1 ${PID_OF_SHIM} This should generate a stack trace and propagate up to the dockerd logs. You should be able to figure out what the pid is as it is the parent process of the container process.

@mauriceteunissen we have the issue with 17.12.1-ce

Does https://github.com/moby/moby/pull/36097 (added to yesterday’s release) fix this issue?

@PhilPhonic yes, can be triggered by healthchecks

Update - I’ve setup a new, 3 node cluster (same VM template) and manually installed RC 1 of docker-18.02.0-ce(https://download.docker.com/linux/static/test/x86_64/docker-18.02.0-ce-rc1.tgz) and have not been able to reproduce the problem. In addition, thanks to #35891, I no longer see the Unknown container message in my logs and all my undefined volumes are also getting removed. I’m going to do some more testing to try and isolate which binary(ies) has the fix.

@cpuguy83 sorry, I understood you just wanted the log independently of whether or not it was failing at the moment.

The compose I’m using at the moment has 36 containers. I tried to reproduce the issue by simply running docker-compose up and docker-compose down. First time was great but the second time, 3 containers remained “up”, and all the others remained in “exited”. Here’s the output of the log:

docker_debug.txt

This is the error reported by docker-compose down:

ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information. If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

One thing I noticed is that it seems to be just one container blocking the others. Particularly, in this case, the 3 containers that weren’t stopped were postgres, etcd and a helper to configure the etcd. However, it looks it’s the postgres blocking the others. For instance, I can run docker inspect etcd and it works, but docker inspect postgres fails with timeout.

Notice this is just an example of this specific case. I’m not saying it’s postgres always the one to blame. Maybe next time it happens, it will be redis or rabbitmq.

Also, it happens using swarm as well.

The original issue was with 17.12

Regarding the original issue, I reproduced it once again and I cannot docker inspect it just hangs for all commands