moby: docker container won't stop

I cannot force a container to stop:

$ sudo docker ps -a
CONTAINER ID        IMAGE                                       COMMAND             CREATED             STATUS              PORTS               NAMES
e732663decf0        sif-gpu                                     "./train.sh"        About an hour ago   Up About an hour                        gifted_brown
3c2d52a9dbae        aa51bb346558                                "./train.sh"        About an hour ago   Up About an hour                        romantic_haibt
9263207753ef        sockeye-gpu                                 "./train.sh"        2 hours ago         Up 2 hours                              jovial_northcutt
994875c39514        sockeye-gpu                                 "train.sh"          16 hours ago        Created                                 wizardly_borg
fbbad3f7140c        3ae77fec5f41                                "train.sh"          16 hours ago        Created                                 relaxed_darwin

$ sudo docker stop 3c2d52a9dbae
3c2d52a9dbae

$ sudo docker ps -a
CONTAINER ID        IMAGE                                       COMMAND             CREATED             STATUS              PORTS               NAMES
e732663decf0        sif-gpu                                     "./train.sh"        About an hour ago   Up About an hour                        gifted_brown
3c2d52a9dbae        aa51bb346558                                "./train.sh"        About an hour ago   Up About an hour                        romantic_haibt
9263207753ef        sockeye-gpu                                 "./train.sh"        2 hours ago         Up 2 hours                              jovial_northcutt
994875c39514        sockeye-gpu                                 "train.sh"          16 hours ago        Created                                 wizardly_borg
fbbad3f7140c        3ae77fec5f41                                "train.sh"          16 hours ago        Created                                 relaxed_darwin

and

$ sudo docker rm 3c2d52a9dbae
Error response from daemon: You cannot remove a running container 3c2d52a9dbaefe37233d0ac411955f7f37ccc5ee16843e34dd42074c98417441. Stop the container before attempting removal or use -f

where the image is being used by the hanging container

$ sudo docker rmi aa51bb346558
Error response from daemon: conflict: unable to delete aa51bb346558 (cannot be forced) - image is being used by running container 3c2d52a9dbae

After several attempts I get a device or resource busy error.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 43 (14 by maintainers)

Most upvoted comments

The more recent reports of this are new issues related to #35933 I’m going to close this as it’s stale and newer issue is tracked in the mentioned issue.

Thanks! 👼 🙇

Experiencing the same issue on Arch linux. sudo systemctl stop docker does not kill the docker-containerd-shim process.

Hard to tell; I don’t think there has been a way to reproduce the issue, so without that, it’s not possible to say if things changed (or not). Docker 17.11 and 17.12 now use the containerd 1.0 runtime, with lots of enhancements, so if you have a reproducible case (and an environment for testing), it could be worth to check if it still reproduces for you on 17.12

Same issue as @KramKroc. When a process is killed by the kernel (memory cgroup out of memory, which is what I was trying to test) inside of my container, not even the entry point, the daemon respond with that error when trying to stop the container:

Error response from daemon: Could not kill running container <containerID>, cannot remove - Cannot kill container <containerID>: process <containerID> not found: not found

EDIT: Hmm, my entry point was also killed, so the “not found” make sens actually. Don’t know why it was killed however and why it was not trying to restart.

We’re seeing the same issue as well with 17.05-ce:

$ docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:25 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:25 2017
 OS/Arch:      linux/amd64
 Experimental: false

Chiming in here to say I’m currently running into this exact same issue. When I strace the stranded container process it’s stuck in a loop that looks like the following:

core@mgmt-core3 ~ $ sudo strace -fp 19651
Process 19651 attached with 7 threads
[pid 19658] epoll_wait(4,  <unfinished ...>
[pid 19657] futex(0xc8202e8a10, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 19653] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 19656] futex(0x2217ec0, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 19655] futex(0xc820032e90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 19654] futex(0xc820032a10, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 19651] futex(0x21edc70, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 19653] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 19653] clock_gettime(CLOCK_MONOTONIC, {10101645, 700502510}) = 0
[pid 19653] clock_gettime(CLOCK_MONOTONIC, {10101645, 700546823}) = 0
[pid 19653] clock_gettime(CLOCK_REALTIME, {1504043461, 385400682}) = 0
[pid 19653] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 19653] clock_gettime(CLOCK_MONOTONIC, {10101645, 700882818}) = 0
[pid 19653] futex(0x21ed120, FUTEX_WAIT, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)

@loretoparisi no, it can be enabled/disabled without restarting the daemon, e.g. (assuming you don’t have a daemon.json yet;

mkdir -p /etc/docker/

echo '{"debug":true}' > /etc/docker/daemon.json

systemctl reload docker.service

@loretoparisi if you want to use tini for a container; it’s included with docker; if you start a container with --init, then tini is automatically inserted in the container;

Start two containers; one with, and one without the --init option:

docker run -dit --name notini busybox
docker run -dit --init --name withtini busybox 

Check the output of docker container top for each container;

docker container top notini

PID                 USER                TIME                COMMAND
22192               root                0:00                sh


docker container top withtini 

PID                 USER                TIME                COMMAND
22261               root                0:00                /dev/init -- sh
22295               root                0:00                sh

You can also set the "init": true option for the daemon configuration file, in which case it will be added by default for every container that’s started; see daemon configuration file

I run the container again, the dockerd rises up to 200% cpu again.

I called docker stats at 16:44:31 and I started seeing errors in the docker log immediately, it didn’t manage to display any stats.

INFO[0001] loading plugin "io.containerd.monitor.v1.cgroups"...  module=containerd type=io.containerd.monitor.v1
INFO[0001] loading plugin "io.containerd.runtime.v1.linux"...  module=containerd type=io.containerd.runtime.v1
DEBU[0001] loading tasks in namespace                    module="containerd/io.containerd.runtime.v1.linux" namespace=moby
INFO[0001] loading plugin "io.containerd.grpc.v1.tasks"...  module=containerd type=io.containerd.grpc.v1
INFO[0001] loading plugin "io.containerd.grpc.v1.version"...  module=containerd type=io.containerd.grpc.v1
INFO[0001] loading plugin "io.containerd.grpc.v1.introspection"...  module=containerd type=io.containerd.grpc.v1
INFO[0001] serving...                                    address="/var/run/docker/containerd/docker-containerd-debug.sock" module="containerd/debug"
INFO[0001] serving...                                    address="/var/run/docker/containerd/docker-containerd.sock" module="containerd/grpc"
INFO[0001] containerd successfully booted in 0.219148s   module=containerd
DEBU[0001] garbage collected                             d=124.484681ms module="containerd/io.containerd.gc.v1.scheduler"
time="2018-01-23T16:44:31.479850804Z" level=debug msg="Calling GET /_ping"
time="2018-01-23T16:44:31.578334698Z" level=debug msg="Calling GET /v1.35/containers/json?all=1"
time="2018-01-23T16:45:04.894009833Z" level=debug msg="Calling GET /_ping"
time="2018-01-23T16:45:04.979324774Z" level=debug msg="Calling GET /v1.35/version"
time="2018-01-23T16:45:05.357875284Z" level=debug msg="Calling GET /v1.35/events?filters=%7B%22type%22%3A%7B%22container%22%3Atrue%7D%7D"
time="2018-01-23T16:45:05.405477625Z" level=debug msg="Calling GET /v1.35/containers/json?limit=0"
time="2018-01-23T16:45:05.610069174Z" level=debug msg="Calling GET /v1.35/containers/970eea2935b7/stats?stream=1"
time="2018-01-23T16:45:06.943016876Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:07.686678398Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:08.670797700Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:09.670771516Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:10.670785645Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:11.675497823Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:12.670733851Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:13.671462870Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:14.670743469Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:15.673333678Z" level=error msg="collecting stats for 970eea2935b78223145aa4e50bd6ab9d5afc65119b43fd2170b21076e7475248: connection error: desc = \"transport: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\": unknown"
time="2018-01-23T16:45:16.601085426Z" level=debug msg="Client context cancelled, stop sending events"
time="2018-01-23T16:45:42.205608962Z" level=debug msg="daemon is not responding" binary=docker-containerd error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=libcontainerd
time="2018-01-23T16:46:06.253183657Z" level=debug msg="daemon is not responding" binary=docker-containerd error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=libcontainerd
time="2018-01-23T16:46:10.052665503Z" level=debug msg="daemon is not responding" binary=docker-containerd error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=libcontainerd
time="2018-01-23T16:48:28.785303803Z" level=debug msg="daemon is not responding" binary=docker-containerd error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=libcontainerd

Unfortunately I had to stop and start the docker service because I couldn’t afford to have that computer locked, I didn’t check the status of containerd. I’ll keep an eye and if it happens again I’ll update this.

Some additional info on the issue we spotted and a possible pointer to the root cause (well, for our system anyway). We’re running a number of containers, majority of which run a java process. I was checking the message log and spotted occasionally a entry like this:

Sep 19 07:09:20 ip-xxx-xx-xx-xxx kernel: Out of memory: Kill process 7319 (java) score 161 or sacrifice child                                 │
Sep 19 07:09:20 ip-xxx-xx-xx-xxx kernel: Killed process 7319 (java) total-vm:7733876kB, anon-rss:2572608kB, file-rss:0kB, shmem-rss:0kB       │

I checked with ps to see if that process was running anymore:

> ps -e | grep java
 6850 ?        00:04:41 java
 7157 ?        00:20:40 java
 7345 ?        00:05:14 java
 7370 ?        00:06:44 java
 7384 ?        00:06:16 java
12723 ?        00:08:17 java
13636 ?        00:21:11 java
14367 ?        00:04:48 java
15654 ?        00:10:44 java

I then did a check on the PIDs from a docker perspective:

> docker ps -q | xargs docker inspect --format '{{.State.Pid}}, {{.Name}}'
7319, /appnode1_deviceconfigurationuiserver_1
13636, /appnode1_apigatewayserver_1
12723, /appnode1_searchserver_1
12247, /appnode1_ingestserver_1
6571, /appnode1_springbootadminserver_1
7370, /appnode1_deviceconfigurationserver_1
6850, /appnode1_turbine_1
7384, /appnode1_carouselserver_1
7345, /appnode1_hystrixdashboard_1
15654, /appnode1_adminserver_1
14367, /appnode1_configserver_1
7157, /appnode1_discoveryserver_1
6839, /appnode1_portainer_1

If you do a docker ps, you see all processes are listed as running, with up time of hours. When I try to restart one of the docker containers associated with the killed process, if appears to work until you view docker ps, where it shows as running with again an up time of hours.

More worryingly is that when you try to restart a process that is actually running, that it too doesn’t show show any downtime and shows itself as running, but in fact the process has been killed:

> docker ps | grep hystrix
4572928e7190        xxxx/xxxx-services-hystrix-dashboard:1.1.0          "java -Xloggc:/var..."   6 days ago          Up 24 hours                                                          appnode1_hystrixdashboard_1
> docker restart 4572928e7190
> docker ps | grep hystrix
4572928e7190        xxxx/xxxx-services-hystrix-dashboard:1.1.0          "java -Xloggc:/var..."   6 days ago          Up 24 hours                                                          appnode1_hystrixdashboard_1
> docker ps -q | xargs docker inspect --format '{{.State.Pid}}, {{.Name}}' | grep appnode1_hystrixdashboard_1
7345, /appnode1_hystrixdashboard_1
> ps -e | grep java
 6850 ?        00:04:43 java
 7157 ?        00:20:51 java
 7370 ?        00:06:48 java
 7384 ?        00:06:20 java
12723 ?        00:08:19 java
13636 ?        00:21:19 java
14367 ?        00:04:50 java
15654 ?        00:10:48 java

And the message log shows something like the following:

Sep 19 15:13:12 ip-172-31-27-157 dockerd: time="2017-09-19T15:13:12.663764809Z" level=info msg="Container 4572928e7190464afda0df564442f158dfd632fbd0855e419463e8a754f9b30f failed to exit within 10 seconds of signal 15 - using the force"
Sep 19 15:13:12 ip-172-31-27-157 dockerd: time="2017-09-19T15:13:12.664249545Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 4572928e7190464afda0df564442f158dfd632fbd0855e419463e8a754f9b30f: rpc error: code = 2 desc = containerd: container not found"
Sep 19 15:13:22 ip-172-31-27-157 dockerd: time="2017-09-19T15:13:22.664524966Z" level=info msg="Container 4572928e7190 failed to exit within 1
0 seconds of kill - trying direct SIGKILL"
Sep 19 15:13:22 ip-172-31-27-157 kernel: XFS (dm-12): Unmounting Filesystem

We had to restart the docker daemon, which killed and restarted the docker container processes too.

We are experiencing probably the same problem. After stopping a bunch of containers at the same time, we noticed a week later that on some hosts there were containers still “running”.

ubuntu@preprod:~$ sudo docker version
Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

Other symptoms are similar: docker stop does not return an error (however writes container not found in the log), docker top says “container not found”. docker inspect states that the container would still be running, however the process ID does not exist.

After the container is stopped:

$ sudo docker stop 3c2d52a9dbae
3c2d52a9dbae

You could also do a sudo docker inspect 3c2d52a9dbae to get the output of the current state of the container. That may shed more light on if the process is actually running.