moby: ISSUE: Can't stop containers (sometimes but often enough to test)
I run a cluster where docker containers sometimes become “unstoppable”. It happens a few times per day across the cluster, and when it happens the only solution is to stop the docker daemon, restart the host machine and then start the docker daemon again.
A regular stop command does not do the job (even when waiting for up to an hour):
# time docker stop --time=1 950677e2317f
^C
real 0m13.508s
user 0m0.036s
sys 0m0.008s
The daemon seems to properly escalate the stop of the container:
# journalctl -fu docker.service
-- Logs begin at Fri 2015-12-11 15:40:55 CET. --
Dec 31 23:30:33 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:33.164731953+01:00" level=info msg="POST /v1.21/containers/950677e2317f/stop?t=1"
Dec 31 23:30:34 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:34.165531990+01:00" level=info msg="Container 950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942 failed to exit within 1 seconds of SIGTERM - using the force"
Dec 31 23:30:44 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:44.165954266+01:00"
The process that seems to be blocking the container stop can be seen on the host machine:
# ps aux | grep [1]1991
root 11991 84.3 0.0 5836 132 ? R Dec30 1300:19 bash -c (echo stop > /tmp/minecraft &)
# top -b | grep [1]1991
11991 root 20 0 5836 132 20 R 89.5 0.0 1300:29 bash
Please note that it is not a [Z]ombie process but rather a [R]unning process. Also note that it’s consuming 84% of the cpu so it is actively doing something and just leaving it is not an option.
Some base information about my setup:
# docker version
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:20:08 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:20:08 UTC 2015
OS/Arch: linux/amd64
# docker info
Containers: 189
Images: 322
Server Version: 1.9.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 700
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.0-19-generic
Operating System: Ubuntu 15.10
CPUs: 24
Total Memory: 125.8 GiB
Name: m3561.contabo.host
ID: ZM2Q:RA6Q:E4NM:5Q2Q:R7E4:BFPQ:EEVK:7MEO:YRH6:SVS6:RIHA:3I2K
# uname -a
Linux m3561.contabo.host 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Worth noting is that stopping the docker daemon does not solve the issue as the process is still running also after the daemon is stopped. This may indicate that the issue is not related to Docker but rather to the kernel.
I’ve seen 20+ of these “stuck processes” and the only “theme” I have noticed is that it seems that they always involve some form of piping of data. Not sure if this is helpful in trying to understand what’s going on here.
Happy to test anything to resolve this. kill -9
on the host machine doesn’t solve it. I can’t run commands from within the container using docker exec
so can’t try killing the process from within.I can get plenty of information of the process from /proc/
, just say what information you want and I’ll get it. Unfortunately, I can’t reproduce the issue (yet) but I regularly (every day) catch it in my live environment. Updating to a different docker version to “test if it solves it” is somewhat costly as it would involve updating the whole cluster, but if there is a good hypothesis that it will solve the issue I’ll go through the work to do it, else my plan is to wait for next stable release until updating.
Really appreciate any help on this as the issue is severely lowering the reliability of the cluster as it requires several-per-day restarts of entire Nodes. Big thanks!
( This stackoverflow post is a duplicate. )
UPDATE: (adding examples of crashed processes)
# ps aux | grep [1]8342
root 18342 92.0 0.0 5836 132 ? R 2015 2279:07 bash -c (echo stop > /tmp/minecraft &)
# ps aux | grep [3]1572
root 31572 95.5 0.0 4448 104 ? R 2015 2946:35 /bin/sh -c (redis-server &>redis.log &) && ./setup-wait.sh && sleep 3 && ./nodebb start && ./nodebb log && sleep infinity
/beetree
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 22 (10 by maintainers)
Install the latest linux-generic-lts-vivid and/or linux-generic-lts-wily packages and the issue is fixed. The packages contains the fixes from the AUFS maintainer:
Check: apt-get changelog linux-image-4.2.0-30-generic / apt-get changelog linux-image-3.19.0-51-generic
[ J. R. Okajima ]
The commit is here: http://kernel.ubuntu.com/git/ubuntu/ubuntu-wily.git/commit/?id=268afce0cdf5f0549131c59721fadce065dac2f0