moby: Killing docker-containerd breaks interaction with containers

When killing docker-containerd, interacting with containers (docker exec, docker stop, docker kill) fails:

docker kill testing
Error response from daemon: Cannot kill container: testing: Cannot kill container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused": unknown

docker rm -f lucid_yalow
Error response from daemon: Could not kill running container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359, cannot remove - Cannot kill container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused": unknown

But killing dockerd (either by killall -9 dockerd or a SIGHUP; killall -HUP dockerd) restores functionality.

This problem could explain some reports about “unkillable” containers, where everything appears to be running, but interaction is not possible (possibly after containerd was OOM killed, but could have different causes).

Steps to reproduce / information

Have docker running, start a container, and check output of ps auxf: docker-containerd and docker-containerd-shim are child-processes of dockerd:

root     11468  1.1  3.4 468232 71036 ?        Ssl  11:56   0:01 /usr/bin/dockerd -H fd://
root     11473  0.4  1.3 236512 27856 ?        Ssl  11:56   0:00  \_ docker-containerd --config /var/run/docker/containerd/containerd.toml
root     11918  0.0  0.1   7516  3788 ?        Sl   11:57   0:00      \_ docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root     11933  0.1  0.0   1236     4 pts/0    Ss+  11:57   0:00          \_ sh

Now, kill docker-containerd (killall -9 docker-containerd).

docker-containerd is restarted (by dockerd); observe that docker-containerd-shim and the container process(es) are reparented (I haven’t checked what the new parent process is, and if this is relevant). The docker-containerd-shim processes are no longer child-process of docker-containerd;

root     11468  160  3.6 470984 74664 ?        Ssl  11:56  19:55 /usr/bin/dockerd -H fd://
root     11979  0.1  1.2 300992 25980 ?        Ssl  11:58   0:01  \_ docker-containerd --config /var/run/docker/containerd/containerd.toml
root     11918  0.0  0.2   7516  4688 ?        Sl   11:57   0:00 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root     11933  0.0  0.0   1236     4 pts/0    Ss+  11:57   0:00  \_ sh

At this point, interacting with containers is now broken…

Containers still show up as running:

docker ps

CONTAINER ID        IMAGE               COMMAND             CREATED              STATUS              PORTS               NAMES
9bfdba3fc8ee        busybox             "sh"                About a minute ago   Up About a minute                       testing

Inspecting the container still works, and shows the pid of the container;

docker inspect --format '{{json .State}}' testing | jq .

{
  "Status": "running",
  "Running": true,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 11933,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2018-01-12T11:57:47.687627373Z",
  "FinishedAt": "0001-01-01T00:00:00Z"
}

But any interaction with the containers is broken;

docker kill testing
Error response from daemon: Cannot kill container: testing: Cannot kill container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused": unknown

docker rm -f lucid_yalow
Error response from daemon: Could not kill running container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359, cannot remove - Cannot kill container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused": unknown

When directly connecting to containerd, containers still show:

docker-containerd-ctr --namespace=moby --address /var/run/docker/containerd/docker-containerd.sock containers ls

CONTAINER                                                           IMAGE    RUNTIME                           
9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359    -        io.containerd.runtime.v1.linux    

And can be inspected;

docker-containerd-ctr --namespace=moby --address /var/run/docker/containerd/docker-containerd.sock containers info 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359

......

Shims are still up:

netstat -x | grep shim
unix  2      [ ]         STREAM     CONNECTED     64641    @/containerd-shim/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359/shim.sock
unix  3      [ ]         STREAM     CONNECTED     64019    @/containerd-shim/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359/shim.sock
docker-runc --root /var/run/docker/runtime-runc/moby/ state 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359
{
  "ociVersion": "1.0.0",
  "id": "9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359",
  "pid": 11933,
  "status": "running",
  "bundle": "/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359",
  "rootfs": "/var/lib/docker/overlay2/9c0e355304db9fb85f7c1281b11008eea23bd4dbb142f11f551066c9fdb2e70e/merged",
  "created": "2018-01-12T11:57:47.631870877Z",
  "owner": ""
}

And the container is still functional, when using docker-runc;

docker-runc --root /var/run/docker/runtime-runc/moby/ exec 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 ls -la

total 44
drwxr-xr-x    1 root     root          4096 Jan 12 11:57 .
drwxr-xr-x    1 root     root          4096 Jan 12 11:57 ..
-rwxr-xr-x    1 root     root             0 Jan 12 11:57 .dockerenv
drwxr-xr-x    2 root     root         12288 Jan  8 21:14 bin
drwxr-xr-x    5 root     root           360 Jan 12 11:57 dev
drwxr-xr-x    1 root     root          4096 Jan 12 11:57 etc
drwxr-xr-x    2 nobody   nogroup       4096 Jan  8 21:14 home
dr-xr-xr-x  125 root     root             0 Jan 12 11:57 proc
drwxr-xr-x    2 root     root          4096 Jan  8 21:14 root
dr-xr-xr-x   13 root     root             0 Jan 12 11:57 sys
drwxrwxrwt    2 root     root          4096 Jan  8 21:14 tmp
drwxr-xr-x    3 root     root          4096 Jan  8 21:14 usr
drwxr-xr-x    4 root     root          4096 Jan  8 21:14 var

restore functionality

Kill dockerd (killall -9 dockerd) or SIGHUP (killall -HUP dockerd).

Observe that shims are not re-parented (which is probably expected);

root     11918  0.0  0.2   7516  4688 ?        Sl   11:57   0:00 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-contai
root     11933  0.0  0.0   1236     4 pts/0    Ss+  11:57   0:00  \_ sh
root     12287  1.1  2.8 446232 57824 ?        Ssl  12:55   0:00 /usr/bin/dockerd -H fd://
root     12293  0.7  1.1 300928 22616 ?        Ssl  12:55   0:00  \_ docker-containerd --config /var/run/docker/containerd/containerd.toml

But now it’s possible again to interact with them:

docker exec testing ls -la

total 44
drwxr-xr-x    1 root     root          4096 Jan 12 11:57 .
drwxr-xr-x    1 root     root          4096 Jan 12 11:57 ..
-rwxr-xr-x    1 root     root             0 Jan 12 11:57 .dockerenv
drwxr-xr-x    2 root     root         12288 Jan  8 21:14 bin
drwxr-xr-x    5 root     root           360 Jan 12 11:57 dev
drwxr-xr-x    1 root     root          4096 Jan 12 11:57 etc
drwxr-xr-x    2 nobody   nogroup       4096 Jan  8 21:14 home
dr-xr-xr-x  126 root     root             0 Jan 12 11:57 proc
drwxr-xr-x    1 root     root          4096 Jan 12 12:58 root
dr-xr-xr-x   13 root     root             0 Jan 12 11:57 sys
drwxrwxrwt    2 root     root          4096 Jan  8 21:14 tmp
drwxr-xr-x    3 root     root          4096 Jan  8 21:14 usr
drwxr-xr-x    4 root     root          4096 Jan  8 21:14 var

Version of docker and containerd

Tested on Ubuntu 16.04 on DigitalOcean;

docker-containerd --version
containerd github.com/containerd/containerd v1.0.0 89623f28b87a6004d4b785663257362d1658a729
Client:
 Version:	18.01.0-ce
 API version:	1.35
 Go version:	go1.9.2
 Git commit:	03596f5
 Built:	Wed Jan 10 20:11:05 2018
 OS/Arch:	linux/amd64
 Experimental:	false
 Orchestrator:	swarm

Server:
 Engine:
  Version:	18.01.0-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.2
  Git commit:	03596f5
  Built:	Wed Jan 10 20:09:37 2018
  OS/Arch:	linux/amd64
  Experimental:	false
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 18.01.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-108-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.953GiB
Name: ubuntu-2gb-ams3-01
ID: KIY5:X5P2:5FI5:GEPC:Q2OO:XF4P:KFB2:S22T:A76T:DVFV:UIFB:ZATY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 26
  • Comments: 31 (15 by maintainers)

Most upvoted comments

We are facing the similar issue, the difference is in reproduce steps. Wen we run out of memory on builders the containerd is killed and restarted by oom-killer. The result is the same.

  1. ps aux | grep docker
  2. sudo kill pid_no

I killed every process one at a time Above two steps worked for me .

@cberner Hopefully. Working on it anyway.

@cberner IIRC, containerd 1.0.2 adds some additional improvements, but https://github.com/moby/moby/pull/36173 was included in 17.12.1 (through https://github.com/docker/docker-ce/pull/417)

Fixed my issue with a renegade container by restarting docker on the Preferences Reset page.

@zmlpjuran thanks for adding that; yes I anticipated that if containerd was OOM-killed, the same would happen (see my top description)