moby: Container cannot be stopped, removed or exec'ed
Description
Some containers cannot be stopped. When running docker stop
, it doesn’t output anything and hangs until Ctrl + C is pressed. The same behavior happens with docker kill
, docker rm -f
and docker attach
(exits only with kill -9
from another terminal). docker inspect
and docker logs
work normally.
When running docker exec <container_id> ls
I get:
rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:83: executing setns process caused \\\"exit status 16\\\"\"\n"
After this, docker exec
gets ‘hanged’, i.e.: # of ExecIDs for the container increase by one. This can be checked by docker inspect --format "{{ len .ExecIDs }} {{ .Name }}" <container_id>
Steps to reproduce the issue: This issue happens randomly (~ once per week) and I couldn’t figure out a way to reproduce it yet.
Describe the results you received: Container can’t be stopped, kiiled or removed. Docker exec doesn’t work.
Describe the results you expected:
Docker container can be stopped and removed (rm
, kill
). Commands can be executed inside of the container (attach
, exec
).
Additional information you deem important (e.g. issue happens only occasionally):
- This issue seems to be related to #29794 and #31007, although the former seems to report many different issues in a single ticket.
- Container process seems to be in uninterruptible sleep state
PID TTY STAT TIME COMMAND
20542 ? Ds 0:00 [dumb-init]
- It seems to be stuck on
zap_pid_ns_processes
kernel call:
# cat /proc/20542/stack
[<ffffffff81120f1f>] zap_pid_ns_processes+0x13f/0x1a0
[<ffffffff81084731>] do_exit+0xa81/0xb00
[<ffffffff81084833>] do_group_exit+0x43/0xb0
[<ffffffff810848b4>] SyS_exit_group+0x14/0x20
[<ffffffff8183c5f2>] entry_SYSCALL_64_fastpath+0x16/0x71
[<ffffffffffffffff>] 0xffffffffffffffff
- Parent process is in sleep state:
# ps 20525
PID TTY STAT TIME COMMAND
20525 ? Sl 0:00 docker-containerd-shim 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a /var/run/docker/libcontainerd/2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86
- Output of
sudo docker-runc exec --cwd / -e PATH=/bin 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a ls
nsenter: failed to open /proc/20542/ns/ipc: No such file or directory
exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 16\""
- Output of
docker logs
(truncated, since the latestsamba-smbd
related error repeat once per minute)
Added user spark.
Collect hostkey for master
Collect hostkey for localhost
Collect hostkey for 0.0.0.0
Starting namenodes on [master]
master: Warning: Permanently added the ECDSA host key for IP address '10.1.15.3' to the list of known hosts.
master: starting namenode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-namenode-master.out
localhost: starting datanode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-datanode-master.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-secondarynamenode-master.out
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/spark-2.0.1-bin-hadoop2.7/logs/spark-spark-org.apache.spark.deploy.master.Master-1-master.out
2017-03-07 02:13:49,357 INFO supervisord started with pid 14
2017-03-07 02:13:50,360 INFO spawned: 'openssh-server' with pid 20
2017-03-07 02:13:50,362 INFO spawned: 'samba-smbd' with pid 21
2017-03-07 02:13:50,363 INFO spawned: 'samba-nmbd' with pid 22
2017-03-07 02:13:51,537 INFO success: openssh-server entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-03-07 02:13:51,537 INFO success: samba-smbd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-03-07 02:13:51,537 INFO success: samba-nmbd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-03-07 02:14:01,628 INFO exited: samba-smbd (terminated by SIGTERM; not expected)
2017-03-07 02:14:02,633 INFO spawned: 'samba-smbd' with pid 425
2017-03-07 02:14:03,833 INFO success: samba-smbd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-03-07 02:15:01,518 INFO exited: samba-smbd (terminated by SIGTERM; not expected)
- Excerpt of
journalctl -u docker
that may be relevant to this issue (gzipped log has 240MB but I can send it upon request).
Mar 07 12:26:10 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T12:26:10.238254768Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-process\", Id:\"2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a\", Status:0x0, Pid:\"879fb34a699c2128485596549cc9d16f7d5f917f7a2b1a4ac23e7c77b10ef313\", Timestamp:(*timestamp.Timestamp)(0xc82446fa70)}"
Mar 07 12:26:10 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T12:26:10.238444580Z" level=debug msg="libcontainerd: event unhandled: type:\"start-process\" id:\"2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a\" pid:\"879fb34a699c2128485596549cc9d16f7d5f917f7a2b1a4ac23e7c77b10ef313\" timestamp:<seconds:1488889570 nanos:237642549 > "
Mar 07 12:26:10 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T12:26:10.774784263Z" level=debug msg="containerd: process exited" id=2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a pid=879fb34a699c2128485596549cc9d16f7d5f917f7a2b1a4ac23e7c77b10ef313 status=0 systemPid=21461
Mar 07 12:26:10 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T12:26:10.775580140Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"exit\", Id:\"2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a\", Status:0x0, Pid:\"879fb34a699c2128485596549cc9d16f7d5f917f7a2b1a4ac23e7c77b10ef313\", Timestamp:(*timestamp.Timestamp)(0xc829510c00)}"
Mar 07 12:26:40 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T12:26:40.776904366Z" level=debug msg="starting exec command 5b934d22287cc939f54f95769b1eb42494db450610e7b01ce6d4d9b607b7a86e in container 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 12:32:04 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T12:32:04.724028384Z" level=debug msg="Sending 15 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 12:34:37 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T12:34:37.899622463Z" level=info msg="Container 2b585a819f31 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Mar 07 13:06:09 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T13:06:09.211150018Z" level=info msg="Container 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a failed to exit within 10 seconds of signal 15 - using the force"
Mar 07 13:06:09 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T13:06:09.211198698Z" level=debug msg="Sending 9 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 13:27:12 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T13:27:12.464715820Z" level=info msg="Container 2b585a819f31 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Mar 07 14:02:24 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T14:02:24.363250206Z" level=info msg="Container 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a failed to exit within 10 seconds of signal 15 - using the force"
Mar 07 14:02:24 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T14:02:24.363341827Z" level=debug msg="Sending 9 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 15:08:06 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T15:08:06.571521919Z" level=info msg="Container 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a failed to exit within 10 seconds of signal 15 - using the force"
Mar 07 15:08:06 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T15:08:06.571621959Z" level=debug msg="Sending 9 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 15:34:02 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T15:34:02.550021915Z" level=debug msg="starting exec command 110737c8b19da62e188434b363150427994185d79a95180f931acae1b71c6b51 in container 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 15:42:07 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T15:42:07.226136748Z" level=info msg="Container 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a failed to exit within 10 seconds of signal 15 - using the force"
Mar 07 15:42:07 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T15:42:07.226229951Z" level=debug msg="Sending 9 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 16:08:17 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T16:08:17.523414592Z" level=debug msg="Sending 15 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 16:11:11 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T16:11:11.920430683Z" level=debug msg="Sending 9 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 16:12:49 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T16:12:49.010837942Z" level=debug msg="starting exec command e4764e23e9740e6212cc6fa2f1429d774b7ea807cfe43602277b505922c2a128 in container 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 16:19:17 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T16:19:17.805066095Z" level=debug msg="Sending 15 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 16:19:19 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T16:19:19.402231732Z" level=debug msg="Sending 15 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
Mar 07 16:19:20 ip-10-69-11-89 dockerd[89174]: time="2017-03-07T16:19:20.855255324Z" level=debug msg="Sending 15 to 2b585a819f31b11a642bb66efd921f5b39fe5b7550b4e06702c0f10e86d4ca1a"
- Output of
sudo ls -l /var/run/runc
,ps axjf
,sudo docker-runc list
: https://gist.github.com/thiagoalves/fe8da3ecabe23990ed9b8b2198c4e69d
Output of docker version
:
Client:
Version: 1.12.3-cs4
API version: 1.24
Go version: go1.6.3
Git commit: 65c6c4c
Built: Fri Nov 11 16:23:03 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.3-cs4
API version: 1.24
Go version: go1.6.3
Git commit: 65c6c4c
Built: Fri Nov 11 16:23:03 2016
OS/Arch: linux/amd64
Output of docker info
:
Containers: 240
Running: 221
Paused: 0
Stopped: 19
Images: 749
Server Version: 1.12.3-cs4
Storage Driver: aufs
Root Dir: /opt/io1/docker/aufs
Backing Filesystem: extfs
Dirs: 3740
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-64-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 128
Total Memory: 1.876 TiB
Name: ip-10-69-11-89
ID: UGZS:UFD3:GB4C:W5MX:JU2L:K7PH:6ZWS:4GPM:27Q5:UNNN:X3DC:YDT7
Docker Root Dir: /opt/io1/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 2362
Goroutines: 2607
System Time: 2017-03-07T16:26:05.191047529Z
EventsListeners: 2
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Additional environment details (AWS, VirtualBox, physical, etc.): Linux 16.04 on EC2 dedicated host (x1.32xlarge)
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 2
- Comments: 21 (20 by maintainers)
@thaJeztah , I don’t have reliable reproducer, so no worth creating a ticket, but IMHO it is valuable to keep accumulating reports until somebody spots a pattern or gets an idea what is going on 😃
we are using tini on some celery containers, but this one was left without it. Another bit of information is that there were numerous OOMkills in that container with last one killed being 10540 (I assume that was a PID1, because that is only one which is not
<defunct>
)Happens on 18.03.1:
docker exec -it fe16ae840d6f bash rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped the container fe16ae840d6f is up!