moby: Docker daemon can't kill an unhealthy container and start a new one
Description
Hello there,
I have several 3-node Swarm clusters running in different regions. Most of them work on machines with Ubuntu Bionic and Docker 20.10.12. A few weeks ago I created one more cluster with Ubuntu Jammy and Docker 24.0.4, because I want to test the upgrade from the old Ubuntu to the new one. I use exactly the same Docker images on it as on the rest of the clusters.
Most of the services work without any problem, only one of them breaks down after a few hours of operation, for example:
ubuntu@my-swarm-node-1:~$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
8wbwbatp5ojc0wb9ldjgt8rrf * my-swarm-node-1 Ready Active Leader 24.0.4
t28vtvaby72edvv0vka4cqmtx my-swarm-node-2 Ready Active Reachable 24.0.4
7hy719eg484smgx4ww3zy1cp7 my-swarm-node-3 Ready Active Reachable 24.0.4
ubuntu@my-swarm-node-1:~$
ubuntu@my-swarm-node-2:~$ docker ps |grep unhealthy
b5715b8730e9 my-api-gateway:5.0.1 "docker-entrypoint.sh" 10 hours ago Up 10 hours (unhealthy) my-platform_my-api-gateway.2.ja6uydyxp88eqpjlicy1a5brx
ubuntu@my-swarm-node-2:~$
ubuntu@my-swarm-node-2:~$ sudo grep b5715b8730e9 /var/log/syslog
Jul 24 09:49:26 my-swarm-node-2 dockerd[1056697]: time="2023-07-24T09:49:26.674053385Z" level=info msg="Configured log driver does not support reads, enabling local file cache for container logs" container=b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86 driver=gelf
Jul 24 09:49:26 my-swarm-node-2 containerd[832]: time="2023-07-24T09:49:26.681052173Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86 pid=1061772 runtime=io.containerd.runc.v2
Jul 24 09:49:26 my-swarm-node-2 systemd[1]: Started libcontainer container b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86.
Jul 24 15:17:33 my-swarm-node-2 dockerd[1056697]: time="2023-07-24T15:17:33.016085570Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86
Jul 24 15:17:33 my-swarm-node-2 systemd[1]: docker-b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86.scope: Deactivated successfully.
Jul 24 15:17:33 my-swarm-node-2 systemd[1]: docker-b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86.scope: Consumed 1min 47.315s CPU time.
Jul 24 15:17:43 my-swarm-node-2 dockerd[1056697]: time="2023-07-24T15:17:43.032267014Z" level=error msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86 error="context deadline exceeded"
ubuntu@my-swarm-node-2:~$
As you can see below I’m not able to check the failed container, because the Docker daemon becomes unresponsive:
ubuntu@my-swarm-node-2:~$ docker logs -n 100 b5715b8730e9
^C
ubuntu@my-swarm-node-2:~$
ubuntu@my-swarm-node-2:~$ docker inspect b5715b8730e9
^C
ubuntu@my-swarm-node-2:~$
I can’t kill the container either:
ubuntu@my-swarm-node-2:~$ docker kill b5715b8730e9
^C
ubuntu@my-swarm-node-2:~$
It seems that container process doesn’t exist, but the Docker daemon still keeps its internal data:
root@my-swarm-node-2:/home/ubuntu# find /var/lib/docker/containers/b5715b8730e9*
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/hosts
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/resolv.conf.hash
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/container-cached.log
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/hostconfig.json
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/config.v2.json
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/resolv.conf
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/mounts
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/checkpoints
/var/lib/docker/containers/b5715b8730e97a8741c982f3bcf864046b31d46df6d64c5f4fb48a5d6aa95a86/hostname
root@my-swarm-node-2:/home/ubuntu#
I noticed the Docker daemon still tries to get the network stats, but it can’t do it, so the number of connections to /run/docker.sock
socket still increases:
ubuntu@my-swarm-node-2:/home/ubuntu# netstat |grep docker.sock |wc -l
1823
ubuntu@my-swarm-node-2:/home/ubuntu#
The only way to fix the issue is to stop and start the Docker daemon. But that only helps for a few hours.
The problem sometimes occurs on my-swarm-node-2
instance, but sometimes it happens also on my-swarm-node-1
and my-swarm-node-3
ones.
I checked the old Docker 20.10.13 as well, but that didn’t help either. There is also an error there. So now I’m not sure if the issue is related to Docker or Ubuntu Jammy.
How can I investigate the issue? As you can see above the Docker daemon is not too verbose. I also can’t see any related system events during this time.
I’ll be thankful for any hints 😃
Reproduce
Here I need to deploy my Swarm stack from scratch and wait for about 5-6 hours.
Expected behavior
Docker should remove an unhealthy container, cleanup all associated data and launch a new container.
docker version
Client: Docker Engine - Community
Version: 24.0.4
API version: 1.43
Go version: go1.20.5
Git commit: 3713ee1
Built: Fri Jul 7 14:50:55 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.4
API version: 1.43 (minimum version 1.12)
Go version: go1.20.5
Git commit: 4ffc614
Built: Fri Jul 7 14:50:55 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client: Docker Engine - Community
Version: 24.0.4
Context: default
Debug Mode: false
Server:
Containers: 11
Running: 8
Paused: 0
Stopped: 3
Images: 16
Server Version: 24.0.4
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: gelf
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: t28vtvaby72edvv0vka4cqmtx
Is Manager: true
ClusterID: 2jujnedl6u4hkfz1xjmkkltb6
Managers: 3
Nodes: 3
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 172.31.42.234
Manager Addresses:
172.31.0.43:2377
172.31.23.159:2377
172.31.42.234:2377
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc version: v1.1.7-0-g860f061
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-76-generic
Operating System: Ubuntu 22.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.79GiB
Name: my-swarm-node-2
ID: 7b2258dd-3d3d-4ead-96b2-be5bf8080ce3
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional Info
A piece of the goroutine dump can be found at https://gist.github.com/ptecza/f47688cd357083ae350586bdf2e2e280
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 17 (6 by maintainers)
Hiya! Sorry for taking a while to reply, have been looking into other things and missed the pings 😅.
From looking at the logs you shared (thanks! ❤️), the breakdown of the situation seems to be:
n
retries, then the container gets moved to an unhealthy state)This is supported by the following logs in your goroutine dump:
3049
goroutines stuck locking on https://github.com/moby/moby/blob/b32c7594a53ff2cc2751c2ae2340736a46d72340/container/state.go#L244-L249which is expected, and why you can’t check/kill/do much with the container.
8
goroutines stuck locking on https://github.com/moby/moby/blob/b32c7594a53ff2cc2751c2ae2340736a46d72340/container/state.go#L215-L2271
goroutine, hanging on https://github.com/moby/moby/blob/b32c7594a53ff2cc2751c2ae2340736a46d72340/container/state.go#L252-L257which doesn’t tell us much, except confirm that the issue is indeed that we’re stuck trying to kill the container, and that’s why everything else is hanging.
Looking further through the logs, we eventually find the root cause of why we’re hanging trying to get kill the container:
While trying to process the container’s exit event, which locks out the container’s state (and hence everything else), we’re locked while closing it’s IO: https://github.com/moby/moby/blob/89b542b421f439b3c703098f7f1c29f661e430bb/pkg/broadcaster/unbuffered.go#L9-L12
that’s stuck on
Write()
:https://github.com/moby/moby/blob/89b542b421f439b3c703098f7f1c29f661e430bb/pkg/broadcaster/unbuffered.go#L23-L31
which locks
Clean()
out of working:https://github.com/moby/moby/blob/89b542b421f439b3c703098f7f1c29f661e430bb/pkg/broadcaster/unbuffered.go#L41-L42
presumably, due to a bad attached writer hanging.
I also found https://github.com/moby/moby/issues/42705 – which is a report for the same issue: one of the writers for that
Unbuffered
gets broken, locking forever onWrite
, which then locks everything else out (in this case it’s the Loki logging driver getting restarted that causes it).Speculating here, but from your logs @ptecza, the cause might be the
gelf
driver misbehaving.update: I just noticed we already also had this other report – https://github.com/moby/moby/issues/41827, AND a PR to fix it: https://github.com/moby/moby/pull/41828. I’ll try to take a look at it/ping the author and see if we can get it reviewed/merged.
@laurazard I have one unhealthy Docker container again 😃
I’ve uploaded the debugging details at https://gist.github.com/ptecza/5c29d0faf64b8bf0d4b0e415136bae8d
It includes a following files:
gistfile1.txt
:gistfile2.txt
:gistfile3.txt
:gistfile4.txt
:Please let me know if I can deliver you something else.