moby: Killing docker-containerd breaks interaction with containers
When killing docker-containerd
, interacting with containers (docker exec
, docker stop
, docker kill
) fails:
docker kill testing
Error response from daemon: Cannot kill container: testing: Cannot kill container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused": unknown
docker rm -f lucid_yalow
Error response from daemon: Could not kill running container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359, cannot remove - Cannot kill container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused": unknown
But killing dockerd
(either by killall -9 dockerd
or a SIGHUP
; killall -HUP dockerd
) restores functionality.
This problem could explain some reports about “unkillable” containers, where everything appears to be running, but interaction is not possible (possibly after containerd
was OOM killed, but could have different causes).
Steps to reproduce / information
Have docker running, start a container, and check output of ps auxf
: docker-containerd
and docker-containerd-shim
are child-processes of dockerd
:
root 11468 1.1 3.4 468232 71036 ? Ssl 11:56 0:01 /usr/bin/dockerd -H fd://
root 11473 0.4 1.3 236512 27856 ? Ssl 11:56 0:00 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml
root 11918 0.0 0.1 7516 3788 ? Sl 11:57 0:00 \_ docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root 11933 0.1 0.0 1236 4 pts/0 Ss+ 11:57 0:00 \_ sh
Now, kill docker-containerd
(killall -9 docker-containerd
).
docker-containerd
is restarted (by dockerd
); observe that docker-containerd-shim
and the container process(es) are reparented (I haven’t checked what the new parent process is, and if this is relevant). The docker-containerd-shim
processes are no longer child-process of docker-containerd
;
root 11468 160 3.6 470984 74664 ? Ssl 11:56 19:55 /usr/bin/dockerd -H fd://
root 11979 0.1 1.2 300992 25980 ? Ssl 11:58 0:01 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml
root 11918 0.0 0.2 7516 4688 ? Sl 11:57 0:00 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root 11933 0.0 0.0 1236 4 pts/0 Ss+ 11:57 0:00 \_ sh
At this point, interacting with containers is now broken…
Containers still show up as running:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9bfdba3fc8ee busybox "sh" About a minute ago Up About a minute testing
Inspecting the container still works, and shows the pid
of the container;
docker inspect --format '{{json .State}}' testing | jq .
{
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 11933,
"ExitCode": 0,
"Error": "",
"StartedAt": "2018-01-12T11:57:47.687627373Z",
"FinishedAt": "0001-01-01T00:00:00Z"
}
But any interaction with the containers is broken;
docker kill testing
Error response from daemon: Cannot kill container: testing: Cannot kill container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused": unknown
docker rm -f lucid_yalow
Error response from daemon: Could not kill running container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359, cannot remove - Cannot kill container 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused": unknown
When directly connecting to containerd, containers still show:
docker-containerd-ctr --namespace=moby --address /var/run/docker/containerd/docker-containerd.sock containers ls
CONTAINER IMAGE RUNTIME
9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 - io.containerd.runtime.v1.linux
And can be inspected;
docker-containerd-ctr --namespace=moby --address /var/run/docker/containerd/docker-containerd.sock containers info 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359
......
Shims are still up:
netstat -x | grep shim
unix 2 [ ] STREAM CONNECTED 64641 @/containerd-shim/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359/shim.sock
unix 3 [ ] STREAM CONNECTED 64019 @/containerd-shim/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359/shim.sock
docker-runc --root /var/run/docker/runtime-runc/moby/ state 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359
{
"ociVersion": "1.0.0",
"id": "9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359",
"pid": 11933,
"status": "running",
"bundle": "/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359",
"rootfs": "/var/lib/docker/overlay2/9c0e355304db9fb85f7c1281b11008eea23bd4dbb142f11f551066c9fdb2e70e/merged",
"created": "2018-01-12T11:57:47.631870877Z",
"owner": ""
}
And the container is still functional, when using docker-runc
;
docker-runc --root /var/run/docker/runtime-runc/moby/ exec 9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 ls -la
total 44
drwxr-xr-x 1 root root 4096 Jan 12 11:57 .
drwxr-xr-x 1 root root 4096 Jan 12 11:57 ..
-rwxr-xr-x 1 root root 0 Jan 12 11:57 .dockerenv
drwxr-xr-x 2 root root 12288 Jan 8 21:14 bin
drwxr-xr-x 5 root root 360 Jan 12 11:57 dev
drwxr-xr-x 1 root root 4096 Jan 12 11:57 etc
drwxr-xr-x 2 nobody nogroup 4096 Jan 8 21:14 home
dr-xr-xr-x 125 root root 0 Jan 12 11:57 proc
drwxr-xr-x 2 root root 4096 Jan 8 21:14 root
dr-xr-xr-x 13 root root 0 Jan 12 11:57 sys
drwxrwxrwt 2 root root 4096 Jan 8 21:14 tmp
drwxr-xr-x 3 root root 4096 Jan 8 21:14 usr
drwxr-xr-x 4 root root 4096 Jan 8 21:14 var
restore functionality
Kill dockerd
(killall -9 dockerd
) or SIGHUP
(killall -HUP dockerd
).
Observe that shims are not re-parented (which is probably expected);
root 11918 0.0 0.2 7516 4688 ? Sl 11:57 0:00 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9bfdba3fc8eee79d6ca5773f7caff5dc5a8379037e98b6ded5c8b68df5750359 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-contai
root 11933 0.0 0.0 1236 4 pts/0 Ss+ 11:57 0:00 \_ sh
root 12287 1.1 2.8 446232 57824 ? Ssl 12:55 0:00 /usr/bin/dockerd -H fd://
root 12293 0.7 1.1 300928 22616 ? Ssl 12:55 0:00 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml
But now it’s possible again to interact with them:
docker exec testing ls -la
total 44
drwxr-xr-x 1 root root 4096 Jan 12 11:57 .
drwxr-xr-x 1 root root 4096 Jan 12 11:57 ..
-rwxr-xr-x 1 root root 0 Jan 12 11:57 .dockerenv
drwxr-xr-x 2 root root 12288 Jan 8 21:14 bin
drwxr-xr-x 5 root root 360 Jan 12 11:57 dev
drwxr-xr-x 1 root root 4096 Jan 12 11:57 etc
drwxr-xr-x 2 nobody nogroup 4096 Jan 8 21:14 home
dr-xr-xr-x 126 root root 0 Jan 12 11:57 proc
drwxr-xr-x 1 root root 4096 Jan 12 12:58 root
dr-xr-xr-x 13 root root 0 Jan 12 11:57 sys
drwxrwxrwt 2 root root 4096 Jan 8 21:14 tmp
drwxr-xr-x 3 root root 4096 Jan 8 21:14 usr
drwxr-xr-x 4 root root 4096 Jan 8 21:14 var
Version of docker and containerd
Tested on Ubuntu 16.04 on DigitalOcean;
docker-containerd --version
containerd github.com/containerd/containerd v1.0.0 89623f28b87a6004d4b785663257362d1658a729
Client:
Version: 18.01.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: 03596f5
Built: Wed Jan 10 20:11:05 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.01.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: 03596f5
Built: Wed Jan 10 20:09:37 2018
OS/Arch: linux/amd64
Experimental: false
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 2
Server Version: 18.01.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-108-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.953GiB
Name: ubuntu-2gb-ams3-01
ID: KIY5:X5P2:5FI5:GEPC:Q2OO:XF4P:KFB2:S22T:A76T:DVFV:UIFB:ZATY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 26
- Comments: 31 (15 by maintainers)
We are facing the similar issue, the difference is in reproduce steps. Wen we run out of memory on builders the containerd is killed and restarted by oom-killer. The result is the same.
I killed every process one at a time Above two steps worked for me .
@cberner Hopefully. Working on it anyway.
@cberner IIRC, containerd 1.0.2 adds some additional improvements, but https://github.com/moby/moby/pull/36173 was included in 17.12.1 (through https://github.com/docker/docker-ce/pull/417)
Fixed my issue with a renegade container by restarting docker on the Preferences Reset page.
@zmlpjuran thanks for adding that; yes I anticipated that if containerd was OOM-killed, the same would happen (see my top description)