moby: Containers cannot be stopped / removed due to rpc error code 2 (setns process caused "exit status 15")

Description

Steps to reproduce the issue:

  1. Unfortunately we haven’t found a way to reproduce the issue

Describe the results you received:

  • docker exec results in an error message
root@docker-linux-1-dh:~# docker exec -it prod_m1af_appserver_paf-as1 ping 8.8.8.8 -c 2                                                                        
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns proc
ess caused \"exit status 15\"" 
  • docker stop and docker restart run forever (until Ctrl+C is pressed)
  • docker attach locks the terminal (Ctrl + C, Ctrl + Z and Ctrl + D don’t work)

This behavior was present only for this container

Describe the results you expected: Docker exec, restart, stop, and attach working fine

Additional information you deem important (e.g. issue happens only occasionally):

root@docker-linux-1-dh:~# docker top prod_m1af_appserver_paf-as1 
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
  • The container had 3 shims related to it
root@docker-linux-1-dh:~# ps -ef | grep 0ce69c426d21
root       1030      1  0 Aug03 ?        00:00:16 docker-containerd-shim 0ce69c426d213b4ad3e07ba6da934555a6ec36a7edcb3050b2951b1b4a4ca445 /var/run/docker/libcontainerd/0ce69c426d213b4ad3e07ba6da934555a6ec36a7edcb3050b2951b1b4a4ca445 docker-runc
root       4250      1  0 Aug01 ?        00:00:00 docker-containerd-shim 0ce69c426d213b4ad3e07ba6da934555a6ec36a7edcb3050b2951b1b4a4ca445 /var/run/docker/libcontainerd/0ce69c426d213b4ad3e07ba6da934555a6ec36a7edcb3050b2951b1b4a4ca445 docker-runc
root      63381      1  0 Aug03 ?        00:00:00 docker-containerd-shim 0ce69c426d213b4ad3e07ba6da934555a6ec36a7edcb3050b2951b1b4a4ca445 /var/run/docker/libcontainerd/0ce69c426d213b4ad3e07ba6da934555a6ec36a7edcb3050b2951b1b4a4ca445 docker-runc
root      93186  25181  0 23:39 pts/229  00:00:00 grep --color=auto 0ce69c426d21
  • We decided to kill -9 the shim processes and restart the daemon. This unlocked the container and it could be restarted
  • Some other (potentially) useful info: – Output of ptrace
root@docker-linux-1-dh:~# strace -p 1030
strace: Process 1030 attached
futex(0x7abf70, FUTEX_WAIT, 0, NULL^Cstrace: Process 1030 detached
 <detached ...>
root@docker-linux-1-dh:~# strace -p 4250
strace: Process 4250 attached
futex(0x7abf70, FUTEX_WAIT, 0, NULL^Cstrace: Process 4250 detached
 <detached ...>
root@docker-linux-1-dh:~# strace -p 30069
strace: Process 30069 attached
futex(0x11604d0, FUTEX_WAIT, 0, NULL^Cstrace: Process 30069 detached
 <detached ...>
root@docker-linux-1-dh:~# strace -p 63381
strace: Process 63381 attached
futex(0x7abf70, FUTEX_WAIT, 0, NULL^Cstrace: Process 63381 detached
 <detached ...>

– Output of top

root@docker-linux-1-dh:~# top

top - 00:04:54 up 20 days,  8:51, 13 users,  load average: 98.12, 97.76, 84.41
Tasks: 6674 total,  16 running, 5690 sleeping,   4 stopped, 964 zombie
%Cpu(s): 41.7 us, 12.0 sy,  0.0 ni, 40.2 id,  3.5 wa,  0.0 hi,  2.7 si,  0.0 st
KiB Mem : 10073190+total, 11610988 free, 79294969+used, 20275830+buff/cache
KiB Swap: 18749896+total, 18439233+free, 31066464 used. 20504608+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                 
 60084 root      20   0 12.384g 8.853g  70512 S 608.6  0.9   1814:56 java                                                                                    
 17747 root      20   0 75.994g 0.012t 0.996g S 356.8  1.3  36915:24 java                                                                                    
 54170 ubuntu    20   0 24.525g 1.339g   7972 S 198.8  0.1 139055:24 java                                                                                    
 47273 1100      20   0 59.725g 3.919g   9984 S 149.4  0.4   3236:15 java                                                                                    
  3754 root      20   0 38.225g 0.022t 306736 S 128.4  2.3 267:14.08 java                                                                                    
 44864 root      20   0 9869424 1.039g   4908 S 103.7  0.1  28554:27 java                                                                                    
  7981 root      20   0 4404252 421344   3924 S  98.8  0.0   8878:45 prometheus-node                                                                         
 57728 root      20   0   49508  10664   3156 R  98.8  0.0   0:06.39 top                                                                                     
 88137 991       20   0 92.637g 4.614g 116568 S  97.5  0.5  54493:43 java                                                                                    
 75603 root      20   0 59.463g 0.011t  22556 S  87.7  1.2 108:26.28 java                                                                                    
 20143 root      20   0 23.169g 841940  10904 S  77.8  0.1  54:06.26 java                                                                                    
 75431 www-data  20   0  438664  63096  47672 S  72.8  0.0   0:04.31 apache2                                                                                 
 86880 root      20   0 59.582g 0.013t  14844 S  54.3  1.4  55:37.37 java                                                                                    
 77324 root      20   0 50.901g 0.014t  11572 S  51.9  1.5  46:58.80 java                                                                                    
 65663 root      20   0 7039944  28180  15064 S  49.4  0.0   0:00.40 java                                                                                    
 16548 root      20   0 24.247g 1.575g   6528 S  39.5  0.2  12267:38 java                                                                                    
 63027 telegraf  20   0 37.089g 192008   3560 S  38.3  0.0   4381:27 beam.smp                                                                                
  6183 root      20   0 23.548g 1.310g   5384 S  24.7  0.1   7467:03 java                                                                                    
 98831 root      20   0 51.066g 8.217g  23260 S  22.2  0.9  18:16.70 java                                                                                    
 56114 _apt      20   0 80.593g 0.021t 626812 S  21.0  2.3   6070:12 java                                                                                    
 83182 root      20   0 59.552g 5.003g  12796 S  21.0  0.5  22:17.87 java                                                                                    
 93232 root      20   0 51.076g 0.013t  14684 S  21.0  1.4  33:26.77 java                                                                                    
 57953 root      20   0 55.767g 279596  27384 S  19.8  0.0   3:38.82 dockerd                                                                                 
 92413 root      20   0 51.083g 0.014t  14504 S  17.3  1.5  33:00.10 java                                                                                    
 26343 root      20   0 14.911g 706740   5080 S  16.0  0.1 494:17.90 java                                                                                    
 76737 root      20   0 7045688 177884  24416 S  16.0  0.0   0:20.76 java                                                                                    
 89283 root      20   0 8727428 171168   3576 S  16.0  0.0   3751:20 beam.smp                                                                                
128453 root      20   0 59.476g 4.092g  18612 S  14.8  0.4   8:42.28 java                                                                                    
 50447 www-data  20   0  216480  38368  27216 S  11.1  0.0   0:02.17 php-fpm

– Output of docker-runc exec

root@docker-linux-1-dh:~/tmp/CENTRAL-7165# docker-runc exec --cwd / -e PATH=/bin 0ce69c426d213b4ad3e07ba6da934555a6ec36a7edcb3050b2951b1b4a4ca445 ls
nsenter: failed to open ipc: No such file or directory
exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""

– (Truncated) output of `docker run hello-world``

root@docker-linux-1-dh:/opt/scripts# docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.
(...)

– Other containers could be restarted normally

root@docker-linux-1-dh:/opt/scripts# docker run --name test1 -d -p 80 nginx:alpine
8b1d528a8cd6a12ccfe5dab532f91f14dc3adb1c3d0ad21c1a9416e900369d52
root@docker-linux-1-dh:/opt/scripts# docker restart 8b1d528a8cd6a12ccf
8b1d528a8cd6a12ccf

– Host had almost 1k zombie processes (seems to be unrelated to this issue and aligned with my comment in https://github.com/moby/moby/issues/31007)

root@docker-linux-1-dh:/opt/scripts# ps -eo uid,pid,ppid,state,wchan:32,cmd | awk '$4 ~ "Z" {print $5}' | sort | uniq -c
    964 exit

More debugging info can be found on: https://gist.github.com/thiagoalves/09c25222e2115fcc6a2d219c5f773a41 and on 2017-08-11-000850.tar.gz

Output of docker version:

root@docker-linux-1-dh:~# docker version
Client:
 Version:      17.03.2-ee-4
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   1e6d71e
 Built:        Fri May 19 20:27:23 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ee-4
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   1e6d71e
 Built:        Fri May 19 20:27:23 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

root@docker-linux-1-dh:~# docker version
Client:
 Version:      17.03.2-ee-4
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   1e6d71e
 Built:        Fri May 19 20:27:23 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ee-4
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   1e6d71e
 Built:        Fri May 19 20:27:23 2017
 OS/Arch:      linux/amd64
 Experimental: false
root@docker-linux-1-dh:~# docker info
Containers: 391
 Running: 376
 Paused: 0
 Stopped: 15
Images: 279
Server Version: 17.03.2-ee-4
Storage Driver: aufs
 Root Dir: /opt/io1/docker/aufs
 Backing Filesystem: extfs
 Dirs: 5401
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-83-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 960.7 GiB
Name: docker-linux-1-dh
ID: UGZS:UFD3:GB4C:W5MX:JU2L:K7PH:6ZWS:4GPM:27Q5:UNNN:X3DC:YDT7
Docker Root Dir: /opt/io1/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 3615
 Goroutines: 2259
 System Time: 2017-08-11T21:33:20.219422409Z
 EventsListeners: 8
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

Additional environment details (AWS, VirtualBox, physical, etc.): AWS EC2 - x1.16xlarge

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 7
  • Comments: 34 (16 by maintainers)

Most upvoted comments

This one is reproducible on many docker versions (17.03-ee, 17.06-ee, 17.07-ce) with different Linux distros (Ubuntu / CentOS) and different environments (AWS, VirtualBox).

It is actually pretty easy to reproduce it. Just create a Linux box with vagrant (4GB RAM + 4GB swap), install any docker version and execute the following commands:

for i in {1..300}; do docker run -d -it --restart=always --name poc_$i talves/health_poc; done docker kill -s TERM $(docker ps -q) docker ps

I was able to consistently reproduce the behavior:

  1. Run a lot of containers with health checks for i in {1..300}; do docker run -d -it --restart=always --name poc_$i talves/health_poc; done
  2. Stop the containers docker kill -s TERM $(docker ps -q)
  3. List the containers. I expected to have no containers in the list but always have a few docker ps
  4. Try to run docker exec on the remaining containers for c in $(docker ps -q); do docker exec $c ls; done

(if step 3 results in 0 containers, start and stop all containers again a few times)

You will get an output like this one:

rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:262: starting container process caused "process_linux.go:81: executing setns process caused \"exit status 15\""

rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:262: starting container process caused "process_linux.go:81: executing setns process caused \"exit status 15\""

rpc error: code = 2 desc = containerd: container not found
rpc error: code = 2 desc = containerd: container not found
rpc error: code = 2 desc = containerd: container not found
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:262: starting container process caused "process_linux.go:81: executing setns process caused \"exit status 15\""

rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:262: starting container process caused "process_linux.go:81: executing setns process caused \"exit status 15\""

We don’t see this issue anymore after we moved to overlay2 from aufs. Also it is not reproducible in test env on overlay2.

@hernandanielg no need, the snippet you gave seems to indicate that you have at least 2 execs that haven’t returned (I’m assuming that the 3rd docker-containerd-shim also have a child).

Could you give me the output of docker version so I know which revision to look at? It looks like the exec shims are not calling waitpid and are not exiting either (which would have had init reap the defunct processes)

Thanks a lot. This confirms that it’s not a memory exhaustion issue! I’m going to try to find the best person to debug this on our side. Thanks once again! 👍🏻

I am facing same issue when I try to exec a command in an unstoppable container

root@docker-linux-build-1-dh:~# docker exec -it prod_eng-product-jenkins_agent_6 ls
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 15\""

The process is in Zombie state

root@docker-linux-build-1-dh:~# docker top prod_eng-product-jenkins_agent_6                                                                                    
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD                
ubuntu              49204               19648               0                   Aug19               ?                   00:00:01            [node] <defunct>   

root@docker-linux-build-1-dh:~# ps -o ppid,state,cmd -p 49204                                                                                                  
  PPID S CMD                                                                                                                                                   
 19648 Z [node] <defunct>

This is the process stack

root@docker-linux-build-1-dh:~# cat /proc/49204/stack                                                                                                          
[<ffffffff810833f5>] do_exit+0x775/0xb00                                                                                                                       
[<ffffffff81083803>] do_group_exit+0x43/0xb0                                                                                                                   
[<ffffffff81083884>] SyS_exit_group+0x14/0x20                                                                                                                  
[<ffffffff818156b2>] entry_SYSCALL_64_fastpath+0x16/0x71                                                                                                       
[<ffffffffffffffff>] 0xffffffffffffffff

This is the parent and grand parent processes, shim process is in Sleeping state

root@docker-linux-build-1-dh:~# ps -o ppid,state,cmd -p 19648
  PPID S CMD
 19588 S [my_init]

root@docker-linux-build-1-dh:~# ps -o ppid,state,cmd -p 19588
  PPID S CMD
     1 S docker-containerd-shim 29703a1f03e0372a480e4017051f155a7de1e422c1b90935655d39c246945037 /var/run/docker/libcontainerd/29703a1f03e0372a480e4017051f155a

@domano I advise that you move to overlay2 if possible. It fixes the problem

Gujys, there is an easy way to get a stuck container -

  1. Deploy container using docker-compose with networking use custom network / bridge

  2. when the container starts, use docker rm -f to remove the container (you should not remove it from network)

  3. check the container status, it will be stuck, you cannot stop it, remove it, or kill it

  4. docker top will display 0 processes running inside the container