moby: Can't stop docker container
Description
Can’t stop container.
I’m starting and removing containers concurrently using docker-compose. Sometimes it fails to remove the containers.
I checked that I can’t docker stop the container. The command hangs and after change docker daemon to debug I just see this line when I run the command.
dockerd[101922]: time="2018-01-04T15:54:07.406980654Z" level=debug msg="Calling POST /v1.35/containers/4c2b5e7f466c/stop"
Steps to reproduce the issue:
- Run tests in jenkins
- Eventually it fails to remove containers.
Describe the results you received:
Can’t stop container.
Describe the results you expected:
Container should have been stopped. And then removed.
Additional information you deem important (e.g. issue happens only occasionally):
Issue happens only occasionally
Output of docker version
:
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:10:14 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:12:46 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 6
Running: 1
Paused: 0
Stopped: 5
Images: 75
Server Version: 17.12.0-ce
Storage Driver: devicemapper
Pool Name: docker-253:0-33643212-pool
Pool Blocksize: 65.54kB
Base Device Size: 10.74GB
Backing Filesystem: xfs
Udev Sync Supported: true
Data file: /dev/loop0
Metadata file: /dev/loop1
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Data Space Used: 31.43GB
Data Space Total: 107.4GB
Data Space Available: 75.95GB
Metadata Space Used: 35.81MB
Metadata Space Total: 2.147GB
Metadata Space Available: 2.112GB
Thin Pool Minimum Free Space: 10.74GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 1
Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-693.11.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 36
Total Memory: 117.9GiB
Name: jenkins-node.com
ID: 5M6L:G2KF:732H:Y7RF:QHNO:3XM4:U6RV:U5QR:ANPA:7XRZ:M3S4:GUZC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 37
Goroutines: 51
System Time: 2018-01-04T16:02:36.54459153Z
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: devicemapper: usage of loopback devices is strongly discouraged for production use.
Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 35
- Comments: 146 (48 by maintainers)
Commits related to this issue
- Docker: Specified stable docker version - Should prevent issues mentioned here (https://github.com/moby/moby/issues/35933) from happening. - Will update the version to latest once a stable release is... — committed to desimaniac/Cloudbox by desimaniac 6 years ago
- Update to Docker 18.03.1 Update Docker to stable version which contains fix for https://github.com/moby/moby/issues/35933 Docker-ce 18.03.1 includes the commit which fixes this containerd issue, p... — committed to stayclassychicago/teamcity-docker-agent by stayclassychicago 6 years ago
- Docker: Specified stable docker version - Should prevent issues mentioned here (https://github.com/moby/moby/issues/35933) from happening. - Will update the version to latest once a stable release is... — committed to Cloudbox/Cloudbox by desimaniac 6 years ago
- Docker: Specified stable docker version - Should prevent issues mentioned here (https://github.com/moby/moby/issues/35933) from happening. - Will update the version to latest once a stable release is... — committed to Cloudbox/Cloudbox by desimaniac 6 years ago
- Use Docker 18.03.1 on CI instead of 17.12.1. 17.12.1 has an issue where it will randomly freeze (see https://github.com/moby/moby/issues/35933). If this resolves the issue running the tests on CI, w... — committed to batect/batect by charleskorn 5 years ago
I get the same issue, though without using docker-compose. I’m using docker swarm. Same thing though, I occasionally get containers that neither docker swarm nor I with the docker CLI can stop. This causes docker swarm to end up collecting more replicas than desired that it can’t scale down. Sometimes these replicas can still service requests and receive traffic. The only way to remove the containers is to restart docker on the effected node.
The “stable” channel 17.12.0 version still has this bug; if it’s fixed, could that PR be back-ported to a patch release 17.12.1? The stable channel is pretty unstable, if people are having to revert all the way to 17.09 or resort to an edge release.
Thank you all for confirming; I’ll go ahead and close this issue.
If you still run into this on Docker 18.03.1 or above; please open a new issue with details
It’s being worked on. Thanks!
Sorry to warm up this thread - it looks like the fix is coming 😉 - but I have a quick question: We’re seeing the exact same issue on docker-ce-17.12 since we added HEALTHCHECKs to our Dockerfiles. The containers without HEALTCHECK specified in their Dockerfiles stop just fine. Could this be related to the HEALTHCHECKs or is this just a coincidence?
Cheers Phil
@timdau We are still on 17.09 in production because this is the most stable version for us due to these “unstoppable containers”
We have stopped using 17.12 completely and rolled back to 17.09 because of this problem on 17.12 (macOS and apparently Linux as well).
This is a critical, persistent problem.
And unfortunately I have not found way to recreate it except using docker a lot.
I have the same issue with docker swarm. I remove one of multuple docker stacks, but only some of the containers in the stack are removed, while some containers hang around. Commands to
docker inspect
ordocker rm
on the hung containers just hang on the command line until I Ctrl-c. Need to reboot to get the containers removed. Did not have the issue in 17.09, only after upgrading to 17.12.0-ce (also had the problem on 17.12.0-ce-rc4).I have the issue on an Azure VM:
docker info
I also have the same issue on Docker for Mac (Edge: 17.12):
docker info
@mavogel I had the same problem with freezing docker containers. The solution for me was that if I move logging from /dev/stderr to internal file inside docker container then the problem is gone. Probably there is some disk issue when container logs to /dev/stderr and probably it is the case for most of problems.
I’m experiencing the same issue in multiple servers using 17.12. As @rfay said, it didn’t happen on 17.09.
Checking the changelog, a major difference between 17.12 and 17.09 is that, since 17.11, Docker is based on containerd. So, as the evidences seem to indicate this is an issue in the runtime, maybe it would be good to investigate down this path.
We also have the same issue on 17.12.1-ce
Over time containers enters a state where
docker ps
anddocker inspect
hangs. Forcing the swarm to redeploy the service makes the container enter a zombie state (Desired state: Shutdown, Current status: Running).docker kill
does not work. One way to kill the container isps aux | grep [container_id]
and thenkill [process_id]
Is there any information needed that I can provide?
+1 for a patch release 17.12.1
The 18.03.1 version seems to fixed this issue. (or mitigated was said @cpuguy83)
I tested in 4 clusters.
There are multiple issues at play here and there have been multiple fixes that take care of different areas.
17.12.1 has been out for some time now. It doesn’t fix all issues but it does fix some. Please update. There are other fixes available in 18.03.0, but it may be worth waiting for 18.03.1 which should be out soon.
This issue is still open because we understand it’s not fixed and it is being worked on. If you want to help there are a number of ways to contribute outside of narrowing down cases… e.g. specific/consistent repro steps, stack traces from an updated docker instance (and containerd and a containerd-shim also helpful), etc.
Coming on here and making false claims an silly posturing is not helpful at all.
The same happens to me in docker-CE 17.12.0 (in 3 clusters), I rolling back to 17.09. It’s incredibly that Docker have now this kind of critical bugs in two LTS versions, and don’t fix it… I understand that maybe it’s difficult to reproduce, but this happen to a lots of persons…
¿It’s because now there are an EE version, and there efforts are now in that version EE 2.2.x (Docker 17.06.x)?
@jcberthon really curious about the result. Cause I’m seeing people that have problems with 18.03.0 aswell. @JnMik We decided to downgrade to 17.09.1 untill this issue is resolved since it was happening way to often on 17.12 and 18.02.
@cpuguy83 Is there a list somewhere of all of the issues related to this problem? That way we can know for sure when this issue is resolved and its safe to upgrade.
That doc are out of date, you can see here: https://github.com/docker/docker-ce/releases/tag/v18.03.1-ce released 11 days ago.
I’ve made some changes to my infrastructure to afford myself the luxury of being able to spend some time collecting logs/information the next time that this happens on my production systems.
I’m currently on Ubuntu 16.04.4 LTS running docker-ce 18.03.1 and Linux Kernel 4.13.0-39-generic x86_64.
Can someone confirm that this is all of the information that would need to be collected in order to provide enough information to help troubleshoot this issue?
docker inspect {container-id} > docker-inspect-container.log
ps -aux | grep {container-id}
to get docker-containerd-shim pidkill -s SIGUSR1 {docker-containerd-shim-pid}
. This should generate a stack trace in the logs for dockerd.sudo journalctl -u docker.service --since today > docker-service-log.txt
docker info
docker version
@victorvarza see the earlier comments: https://github.com/moby/moby/issues/35933#issuecomment-378957035 - if you’re on 17.12; at least upgrade to 17.12.1, but given that 17.12 reached EOL, consider 18.03 (but you may want to wait for 18.03.1, which will have some fixes)
We are also sticking to 17.09.1 because newer versions are not working for us.
I think I have a related problem
I’ve upgrade our dev environement to the latest 17.12.1-ce, build 7390fc6 last week and it’s the first time I see this error.
I developer tried to update an application, and swarm is unable to delete an old container of the previous version on a specific node on the cluster. I found out because developers started complaining about a white page syndrom in a intermittent way.
When I do a docker service ps on the service, here’s what I see : https://www.screencast.com/t/LXAfmddRDp The old container is running but in shutdown state.
ON the node, I see the container as if it were running in a healthy way : https://www.screencast.com/t/ABKVYxNUQ
And from the “docker service ls” , I have more containers than expected https://www.screencast.com/t/0Po8Sqs0Jr
I tried running docker kill and docker inspect on the container from the node but it’s not working. I do not have any specific messager in dmesg.
That’s all I can tell from now, I’ll remove the stack and launch it again so developers are enable to continue their work.
Hope it helps
EDIT:
I saw some error like this on the node during the process
ar 13 10:04:10 server-name dockerd: time=“2018-03-13T10:04:10.406196465-04:00” level=error msg=“Failed to load container f5d6bb74d6b37871b72b5f27d46f8705a6b66cba7afb50706bbf68b764facb24: open /var/lib/docker/containers/f5d6bb74d6b37871b72b5f27d46f8705a6b66cba7afb50706bbf68b764facb24/config.v2.json: no such file or directory” Mar 13 10:04:10 server-name dockerd: time=“2018-03-13T10:04:10.408039262-04:00” level=error msg=“Failed to load container fd5ac869991b263a28c36bddf9b2847a8a26e2b7d59fa033f85e9616b0b7cb7a: open /var/lib/docker/containers/fd5ac869991b263a28c36bddf9b2847a8a26e2b7d59fa033f85e9616b0b7cb7a/config.v2.json: no such file or directory”
EDIT2: Found somebody else with the same issue : https://github.com/moby/moby/issues/36553
I experience the same bug. It is not consistent though. I don’t see a pattern yet but it does happen.
I am running Docker for Mac Version 17.12.0-ce-mac46 (21698). I am not running Docker in Docker.
Container is created by
docker-compose up
.Yes I can see that container is still running but
stop
orkill
just hangs and does nothing.(You can see that minutes passed before I pressed Ctrl-C)
In another Terminal I tried to start another docker-compose project, that’s what I have seen in the output the first time:
Another project started fine but with these errors about stale file names above. Subsequent stops and starts of the another project did not throw any errors and worked fine.
These files are on a named volume. The volume is mounted as
ro
in docker-compose, so I’m not sure why there are “cant remove” messages.Restarting Docker daemon solves the issue… temporarily. I forgot to do
docker inspect
and already restarted daemon but I thinkinspect
would just hang likestop
andkill
do.UPDATE: wanted to note that the container with issues has healthcheck on it. Looks like this might be the culprit.
We are also experiencing non-responsive docker-deamon on some commands:
currently I cannot
docker rmi
docker system prune -f
docker exec
docker logs
this happends on multiple engines, all running 17.12.
seems related to https://github.com/moby/moby/issues/35408
Seems like 18.03.1 has fixed the issue for me. I have been using it for a week locally, but did not experience the issue, that was easily reproducible within a day otherwise.
It is interesting because for my original issue updating to 18.02 was the solution. Well at least so far so good.
In the last (almost) 5 days (that is when I upgraded to Docker CE 18.03.0), I did not encounter the issue.
It does not mean it is solved in 18.03.0, it is too early to tell. But at least this is less often occuring. Before I had the problem at least every 2 or 3 days. 🤞
Hi @cpuguy83 I had to reboot the host (before I saw your message), because restarting the docker.service did not work, and killing the processes did not help restarting the containers afterwards. So I went through a complete reboot cycle rather than fiddling until I get back to a clean state.
So I need to wait for next lock-up before I can report the stack dump for docker-containerd-shim. I’m now on 18.03.0 though…
Anyway thanks for getting back quickly to me 😃
@jcberthon Thanks, this seems like the same issue as above on first look. To get a stack dump from
docker-containerd-shim
dokill -s SIGUSR1 <docker-containerd-shim-pid>
. This should generate a stack trace in the logs for dockerd.I’m again stuck with that problem. This time when trying to upgrade from 17.12.1 to 18.03.0. The upgrade process is stuck, most containers are still running (because the application are still up and running, but
docker ps
is stuck).I’ve done a dump of the docker-containerd socket, here is the gist: https://gist.github.com/jcberthon/143c3e6b7c9e5fc8f18c9204ca1bedf6
I do not know how to do a dump of
docker-containerd-shim
.@mhaamann Thanks! Digging deeper…
This looks like it is stuck getting the state of the container from the shim process. Are you able to trigger a stack trace on the shim?
kill -SIGUSR1 ${PID_OF_SHIM}
This should generate a stack trace and propagate up to the dockerd logs. You should be able to figure out what the pid is as it is the parent process of the container process.@mauriceteunissen we have the issue with 17.12.1-ce
Does https://github.com/moby/moby/pull/36097 (added to yesterday’s release) fix this issue?
@PhilPhonic yes, can be triggered by healthchecks
Update - I’ve setup a new, 3 node cluster (same VM template) and manually installed RC 1 of docker-18.02.0-ce(https://download.docker.com/linux/static/test/x86_64/docker-18.02.0-ce-rc1.tgz) and have not been able to reproduce the problem. In addition, thanks to #35891, I no longer see the
Unknown container
message in my logs and all my undefined volumes are also getting removed. I’m going to do some more testing to try and isolate which binary(ies) has the fix.@cpuguy83 sorry, I understood you just wanted the log independently of whether or not it was failing at the moment.
The compose I’m using at the moment has 36 containers. I tried to reproduce the issue by simply running docker-compose up and docker-compose down. First time was great but the second time, 3 containers remained “up”, and all the others remained in “exited”. Here’s the output of the log:
docker_debug.txt
This is the error reported by docker-compose down:
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information. If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
One thing I noticed is that it seems to be just one container blocking the others. Particularly, in this case, the 3 containers that weren’t stopped were postgres, etcd and a helper to configure the etcd. However, it looks it’s the postgres blocking the others. For instance, I can run
docker inspect etcd
and it works, butdocker inspect postgres
fails with timeout.Notice this is just an example of this specific case. I’m not saying it’s postgres always the one to blame. Maybe next time it happens, it will be redis or rabbitmq.
Also, it happens using swarm as well.
The original issue was with 17.12
Regarding the original issue, I reproduced it once again and I cannot docker inspect it just hangs for all commands