moby: ISSUE: Can't stop containers (sometimes but often enough to test)

I run a cluster where docker containers sometimes become “unstoppable”. It happens a few times per day across the cluster, and when it happens the only solution is to stop the docker daemon, restart the host machine and then start the docker daemon again.

A regular stop command does not do the job (even when waiting for up to an hour):

# time docker stop --time=1 950677e2317f
^C
real    0m13.508s
user    0m0.036s
sys     0m0.008s

The daemon seems to properly escalate the stop of the container:

# journalctl -fu docker.service
-- Logs begin at Fri 2015-12-11 15:40:55 CET. --
Dec 31 23:30:33 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:33.164731953+01:00" level=info msg="POST /v1.21/containers/950677e2317f/stop?t=1"
Dec 31 23:30:34 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:34.165531990+01:00" level=info msg="Container 950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942 failed to exit within 1 seconds of SIGTERM - using the force"
Dec 31 23:30:44 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:44.165954266+01:00"

The process that seems to be blocking the container stop can be seen on the host machine:

# ps aux | grep [1]1991
root     11991 84.3  0.0   5836   132 ?        R    Dec30 1300:19 bash -c (echo stop > /tmp/minecraft &)
# top -b | grep [1]1991
11991 root      20   0    5836    132     20 R  89.5  0.0   1300:29 bash

Please note that it is not a [Z]ombie process but rather a [R]unning process. Also note that it’s consuming 84% of the cpu so it is actively doing something and just leaving it is not an option.

Some base information about my setup:

# docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:20:08 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:20:08 UTC 2015
 OS/Arch:      linux/amd64

# docker info
Containers: 189
Images: 322
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 700
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.0-19-generic
Operating System: Ubuntu 15.10
CPUs: 24
Total Memory: 125.8 GiB
Name: m3561.contabo.host
ID: ZM2Q:RA6Q:E4NM:5Q2Q:R7E4:BFPQ:EEVK:7MEO:YRH6:SVS6:RIHA:3I2K

# uname -a
Linux m3561.contabo.host 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Worth noting is that stopping the docker daemon does not solve the issue as the process is still running also after the daemon is stopped. This may indicate that the issue is not related to Docker but rather to the kernel.

I’ve seen 20+ of these “stuck processes” and the only “theme” I have noticed is that it seems that they always involve some form of piping of data. Not sure if this is helpful in trying to understand what’s going on here.

Happy to test anything to resolve this. kill -9 on the host machine doesn’t solve it. I can’t run commands from within the container using docker exec so can’t try killing the process from within.I can get plenty of information of the process from /proc/, just say what information you want and I’ll get it. Unfortunately, I can’t reproduce the issue (yet) but I regularly (every day) catch it in my live environment. Updating to a different docker version to “test if it solves it” is somewhat costly as it would involve updating the whole cluster, but if there is a good hypothesis that it will solve the issue I’ll go through the work to do it, else my plan is to wait for next stable release until updating.

Really appreciate any help on this as the issue is severely lowering the reliability of the cluster as it requires several-per-day restarts of entire Nodes. Big thanks!

( This stackoverflow post is a duplicate. )

UPDATE: (adding examples of crashed processes)

# ps aux | grep [1]8342
root     18342 92.0  0.0   5836   132 ?        R     2015 2279:07 bash -c (echo stop > /tmp/minecraft &)

# ps aux | grep [3]1572
root     31572 95.5  0.0   4448   104 ?        R     2015 2946:35 /bin/sh -c (redis-server &>redis.log &) && ./setup-wait.sh && sleep 3 && ./nodebb start && ./nodebb log && sleep infinity

/beetree

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 22 (10 by maintainers)

Most upvoted comments

Install the latest linux-generic-lts-vivid and/or linux-generic-lts-wily packages and the issue is fixed. The packages contains the fixes from the AUFS maintainer:

Check: apt-get changelog linux-image-4.2.0-30-generic / apt-get changelog linux-image-3.19.0-51-generic

[ J. R. Okajima ]

SAUCE: ubuntu: aufs: tiny, extract a new func xino_fwrite_wkq()
- LP: #1533043
SAUCE: ubuntu: aufs: for 4.3, XINO handles EINTR from the dying process
- LP: #1533043

The commit is here: http://kernel.ubuntu.com/git/ubuntu/ubuntu-wily.git/commit/?id=268afce0cdf5f0549131c59721fadce065dac2f0

ricardobranco777 on Mar 3, 2016