moby: Sometimes the device mapper is stuck in an infinite “mount/remount” loop

When this happens, the system is very unresponsive, a simple docker ps can last for about 1 minute, and containers need even more time to be started, if they start at all. Application logs and docker logs do not display any error, the only low-level curious activity is found in the system logs:

Feb  6 23:58:04 ip-172-16-4-30 kernel: [723943.329608] XFS (dm-26): Unmounting Filesystem
Feb  6 23:58:24 ip-172-16-4-30 kernel: [723963.978634] XFS (dm-26): Mounting V4 Filesystem
Feb  6 23:59:22 ip-172-16-4-30 kernel: [724021.561906] XFS (dm-26): Ending clean mount
Feb  6 23:59:22 ip-172-16-4-30 kernel: [724021.564932] XFS (dm-267): Unmounting Filesystem
Feb  6 23:59:42 ip-172-16-4-30 kernel: [724041.461206] XFS (dm-111): Mounting V4 Filesystem
Feb  7 00:00:42 ip-172-16-4-30 kernel: [724101.702141] XFS (dm-111): Ending clean mount

This is a not so rare condition, it affects about 3% of the systems deployed. There is about one iteration each second. It can last over 1 day and there is no obvious way to resolve it, aside ditching the systems and ordering brand new ones. (Service restart, reboot, or docker library removal will not salvage the system.)

$ docker info
Containers: 3
Images: 17
Server Version: 1.9.1
Storage Driver: devicemapper
 Pool Name: docker-202:1-264750-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem: 
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 795.2 MB
 Data Space Total: 107.4 GB
 Data Space Available: 27.62 GB
 Metadata Space Used: 2.032 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.145 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.7-15.23.amzn1.x86_64
Operating System: Amazon Linux AMI 2015.09
CPUs: 2
Total Memory: 3.862 GiB
Name: ip-172-16-1-241
ID: MKMR:UVOD:VZ5M:AZWU:5CH7:YXWZ:UOI3:FY4R:YBZA:3MU4:AFG7:CH7U
Username: quintlyworker
Registry: https://index.docker.io/v1/
$ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5/1.9.1
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5/1.9.1
 Built:        
 OS/Arch:      linux/amd64
$ uname -a
Linux ip-172-16-1-241 4.1.7-15.23.amzn1.x86_64 #1 SMP Mon Sep 14 23:20:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 8
  • Comments: 50 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Hi All, I can still see this issue with Docker 1.12.5 (running inside Kubernetes). Is there any way to fix it?

Just compare loop based device-mapper with lvm-thin-pool based device-mapper, it seems that the “stuck” happened only on nodes with lvm-thin-pool based device-mapper.

I have exact issue with docker 1.12.6/centos7.3

docker info

Containers: 1536 Running: 61 Paused: 0 Stopped: 1475 Images: 29 Server Version: 1.12.6 Storage Driver: devicemapper Pool Name: docker-docker–pool Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 42.91 GB Data Space Total: 522.3 GB Data Space Available: 479.3 GB Metadata Space Used: 21.15 MB Metadata Space Total: 553.6 MB Metadata Space Available: 532.5 MB Thin Pool Minimum Free Space: 52.23 GB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 Library Version: 1.02.135-RHEL7 (2016-11-16) Logging Driver: json-file Cgroup Driver: systemd Plugins: Volume: local Network: null bridge host overlay Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Security Options: seccomp Kernel Version: 3.10.0-514.6.2.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 Number of Docker Hooks: 2 CPUs: 8 Total Memory: 62.69 GiB Name: ennew-gpu-148 ID: PHDR:KCER:EEDL:UUCH:WRDF:U7IB:B5WX:6VSG:LRYT:3NK5:VGRZ:N7MZ Docker Root Dir: /data/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Insecure Registries: 10.19.140.200:30100 127.0.0.0/8 Registries: docker.io (secure)

I was stress testing kubernetes pod creation, which would start about 100*2 docker container on each node, but on some node, it would take a very long time to start running those docker containers, most of the containers were stuck in “Created” status. But eventually, after about 1 hours, the containers would start to run. (I am not sure whether docker recovered itself or kubernets did some help) When containers were stucked in “Created”, following logs were observed in system log

/var/log/messages

May 4 10:56:22 hostname dockerd-current: time=“2017-05-04T10:56:22.401729002+08:00” level=info msg=“{Action=create, LoginUID=4294967295, PID=21057}” May 4 10:56:22 hostname kernel: XFS (dm-56): Mounting V5 Filesystem May 4 10:56:23 hostname kernel: XFS (dm-56): Ending clean mount May 4 10:56:23 hostname kernel: XFS (dm-56): Unmounting Filesystem May 4 10:56:23 hostname kernel: XFS (dm-56): Mounting V5 Filesystem May 4 10:56:23 hostname kernel: XFS (dm-56): Ending clean mount May 4 10:56:23 hostname kernel: XFS (dm-56): Unmounting Filesystem May 4 10:56:24 hostname kernel: XFS (dm-56): Mounting V5 Filesystem May 4 10:56:24 hostname kernel: XFS (dm-56): Ending clean mount May 4 10:56:24 hostname kernel: XFS (dm-56): Unmounting Filesystem May 4 10:56:25 hostname kernel: XFS (dm-56): Mounting V5 Filesystem May 4 10:56:25 hostname kernel: XFS (dm-56): Ending clean mount May 4 10:56:25 hostname kernel: XFS (dm-56): Unmounting Filesystem May 4 10:56:26 hostname dockerd-current: time=“2017-05-04T10:56:26.159383572+08:00” level=info msg=“{Action=create, LoginUID=4294967295, PID=21057}” May 4 10:56:26 hostname kernel: XFS (dm-56): Mounting V5 Filesystem May 4 10:56:26 hostname kernel: XFS (dm-56): Ending clean mount May 4 10:56:26 hostname kernel: XFS (dm-56): Unmounting Filesystem May 4 10:56:27 hostname kubelet: E0504 10:56:27.242944 21057 docker_manager.go:2324] container start failed: RunContainerError: runContainer: operation timeout: context deadline exceeded May 4 10:56:27 hostname kubelet: E0504 10:56:27.243021 21057 pod_workers.go:184] Error syncing pod c009e7fe-306d-11e7-b303-9c7da38259a3, skipping: failed to “StartContainer” for “podstresstest-592” with RunContainerError: “runContainer: operation timeout: context deadline exceeded” May 4 10:56:27 hostname kernel: XFS (dm-56): Mounting V5 Filesystem May 4 10:56:27 hostname kernel: XFS (dm-56): Ending clean mount May 4 10:56:27 hostname kernel: XFS (dm-56): Unmounting Filesystem May 4 10:56:27 hostname dockerd-current: time=“2017-05-04T10:56:27.773147145+08:00” level=info msg=“{Action=create, LoginUID=4294967295, PID=21057}” May 4 10:56:28 hostname kernel: XFS (dm-56): Mounting V5 Filesystem May 4 10:56:28 hostname kernel: XFS (dm-56): Ending clean mount May 4 10:56:28 hostname kernel: XFS (dm-56): Unmounting Filesystem May 4 10:56:28 hostname kernel: XFS (dm-56): Mounting V5 Filesystem May 4 10:56:28 hostname kernel: XFS (dm-56): Ending clean mount May 4 10:56:28 hostname kernel: XFS (dm-56): Unmounting Filesystem

So, I would have liked to provide output of docker inspect for all containers on the machine but docker ps did not respond after more than one hour (!). First, let’s take a look at top:

top - 09:41:34 up 1 day, 23:55,  2 users,  load average: 2.48, 2.27, 2.13
Tasks: 1331 total,   1 running, 1330 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.7%sy,  0.0%ni,  0.0%id, 98.7%wa,  0.0%hi,  0.0%si,  0.3%st
Mem:   2051636k total,  1951940k used,    99696k free,   134560k buffers
Swap:        0k total,        0k used,        0k free,   691432k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
11303 root      20   0     0    0    0 S  0.3  0.0   0:00.88 kworker/u30:2      
12316 root      28   8  441m  37m 7556 S  0.3  1.9   7:20.56 aws                
30367 root      20   0 37784  32m 7612 S  0.3  1.6  13:00.51 agent              
    1 root      20   0 19512 2480 2184 S  0.0  0.1   0:00.70 init               
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.06 kthreadd           
    3 root      20   0     0    0    0 S  0.0  0.0   0:23.00 ksoftirqd/0 

As you see the load average (2.41) is rather high (I saw peeks at 3.15) while no process seems to cause this activity, which is a sign than this activity takes place in the kernel. While investigating the system, my SSH connection dropped (SIGPIPE) which of course can have a variety of causes but could be a further symptom than some system resource is short.

This is the last segment of very strange activity

[  +0.467019] XFS (dm-4): Unmounting Filesystem
[ +20.139963] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:17] XFS (dm-4): Ending clean mount
[ +18.886583] XFS (dm-6): Mounting V4 Filesystem
[Apr 7 17:18] XFS (dm-6): Ending clean mount
[  +0.002568] XFS (dm-4): Unmounting Filesystem
[Apr 7 17:19] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:20] XFS (dm-4): Ending clean mount
[  +0.379894] XFS (dm-4): Unmounting Filesystem
[ +19.722320] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:21] XFS (dm-4): Ending clean mount
[ +19.441943] XFS (dm-8): Mounting V4 Filesystem
[Apr 7 17:22] XFS (dm-8): Ending clean mount
[  +0.003967] XFS (dm-6): Unmounting Filesystem
[ +19.935995] XFS (dm-6): Mounting V4 Filesystem
[Apr 7 17:23] XFS (dm-6): Ending clean mount
[  +0.157200] XFS (dm-6): Unmounting Filesystem
[  +1.949623] XFS (dm-4): Unmounting Filesystem
[Apr 7 17:24] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:25] XFS (dm-4): Ending clean mount
[  +0.548252] XFS (dm-4): Unmounting Filesystem
[  +1.303811] XFS (dm-8): Unmounting Filesystem
[  +0.004251] veth64ca5be: renamed from eth0
[  +0.002877] docker0: port 3(vethd4c3f56) entered disabled state
[  +0.027881] docker0: port 3(vethd4c3f56) entered disabled state
[  +0.007145] device vethd4c3f56 left promiscuous mode
[  +0.005787] docker0: port 3(vethd4c3f56) entered disabled state
[ +17.959230] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:26] XFS (dm-4): Ending clean mount
[  +0.497421] XFS (dm-4): Unmounting Filesystem
[ +19.412757] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:27] XFS (dm-4): Ending clean mount
[ +17.735705] XFS (dm-6): Mounting V4 Filesystem
[Apr 7 17:28] XFS (dm-6): Ending clean mount
[Apr 7 17:29] XFS (dm-8): Mounting V4 Filesystem
[Apr 7 17:30] XFS (dm-8): Ending clean mount
[  +0.008627] XFS (dm-5): Unmounting Filesystem
[ +19.657035] XFS (dm-5): Mounting V4 Filesystem
[Apr 7 17:31] XFS (dm-5): Ending clean mount
[ +17.320923] XFS (dm-9): Mounting V4 Filesystem
[Apr 7 17:32] XFS (dm-9): Ending clean mount
[  +0.002930] XFS (dm-6): Unmounting Filesystem
[ +19.607763] XFS (dm-6): Mounting V4 Filesystem
[Apr 7 17:33] XFS (dm-6): Ending clean mount
[  +0.219859] XFS (dm-6): Unmounting Filesystem
[Apr 7 17:34] XFS (dm-6): Mounting V4 Filesystem
[Apr 7 17:35] XFS (dm-6): Ending clean mount
[ +16.683462] XFS (dm-10): Mounting V4 Filesystem
[Apr 7 17:36] XFS (dm-10): Ending clean mount
[  +0.003253] XFS (dm-5): Unmounting Filesystem
[ +16.864628] XFS (dm-5): Mounting V4 Filesystem
[Apr 7 17:37] XFS (dm-5): Ending clean mount
[  +0.200108] XFS (dm-5): Unmounting Filesystem
[  +1.788298] XFS (dm-9): Unmounting Filesystem
[ +19.211344] XFS (dm-5): Mounting V4 Filesystem
[Apr 7 17:38] XFS (dm-5): Ending clean mount
[  +0.449888] XFS (dm-5): Unmounting Filesystem
[  +2.113912] XFS (dm-4): Unmounting Filesystem
[Apr 7 17:39] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:40] XFS (dm-4): Ending clean mount
[  +0.283606] XFS (dm-4): Unmounting Filesystem
[  +1.851654] XFS (dm-6): Unmounting Filesystem
[ +18.579646] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:41] XFS (dm-4): Ending clean mount
[  +0.279984] XFS (dm-4): Unmounting Filesystem
[ +19.185764] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:42] XFS (dm-4): Ending clean mount
[Apr 7 17:43] XFS (dm-5): Mounting V4 Filesystem
[Apr 7 17:44] XFS (dm-5): Ending clean mount
[  +0.005070] XFS (dm-8): Unmounting Filesystem
[ +18.203188] XFS (dm-6): Mounting V4 Filesystem
[Apr 7 17:45] XFS (dm-6): Ending clean mount
[  +0.833799] XFS (dm-6): Unmounting Filesystem
[ +21.465551] XFS (dm-6): Mounting V4 Filesystem
[Apr 7 17:46] XFS (dm-6): Ending clean mount
[  +0.007852] XFS (dm-10): Unmounting Filesystem
[ +19.628602] XFS (dm-8): Mounting V4 Filesystem
[Apr 7 17:47] XFS (dm-8): Ending clean mount
[  +0.091196] XFS (dm-8): Unmounting Filesystem
[Apr 7 17:48] XFS (dm-8): Mounting V4 Filesystem
[Apr 7 17:49] XFS (dm-8): Ending clean mount
[ +16.641921] XFS (dm-9): Mounting V4 Filesystem
[Apr 7 17:50] XFS (dm-9): Ending clean mount
[ +19.338014] XFS (dm-10): Mounting V4 Filesystem
[Apr 7 17:51] XFS (dm-10): Ending clean mount
[  +0.003331] XFS (dm-5): Unmounting Filesystem
[ +19.264335] XFS (dm-5): Mounting V4 Filesystem
[Apr 7 17:52] XFS (dm-5): Ending clean mount
[  +0.329826] XFS (dm-5): Unmounting Filesystem
[Apr 7 17:53] XFS (dm-5): Mounting V4 Filesystem
[Apr 7 17:54] XFS (dm-5): Ending clean mount
[ +20.266164] XFS (dm-11): Mounting V4 Filesystem
[Apr 7 17:55] XFS (dm-11): Ending clean mount
[ +17.799925] XFS (dm-12): Mounting V4 Filesystem
[Apr 7 17:56] XFS (dm-12): Ending clean mount
[  +0.004854] XFS (dm-4): Unmounting Filesystem
[ +17.559186] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:57] XFS (dm-4): Ending clean mount
[  +0.419773] XFS (dm-4): Unmounting Filesystem
[Apr 7 17:58] XFS (dm-4): Mounting V4 Filesystem
[Apr 7 17:59] XFS (dm-4): Ending clean mount
[ +17.538426] XFS (dm-13): Mounting V4 Filesystem
[Apr 7 18:00] XFS (dm-13): Ending clean mount
[ +18.712564] XFS (dm-14): Mounting V4 Filesystem
[Apr 7 18:01] XFS (dm-14): Ending clean mount
[  +0.002531] XFS (dm-10): Unmounting Filesystem
[Apr 7 18:02] XFS (dm-10): Mounting V4 Filesystem
[ +59.047333] XFS (dm-10): Ending clean mount
[Apr 7 18:03] XFS (dm-10): Unmounting Filesystem
[ +19.334478] XFS (dm-10): Mounting V4 Filesystem

after about 1200 similar messages comes the message

[Apr 8 02:19] XFS (dm-35): Mounting V4 Filesystem
[ +52.041289] XFS (dm-35): Ending clean mount
[Apr 8 02:20] device-mapper: thin: Data device (dm-1) discard unsupported: Disabling discard passdown.
[  +0.006215] device-mapper: thin: 253:2: growing the data device from 18000 to 21600 blocks
[  +4.714625] XFS (dm-92): Mounting V4 Filesystem
[Apr 8 02:21] XFS (dm-92): Ending clean mount

Then come about 1000 similar lines with the ending clean-mount, unmounting filesystem mounting, v4 filesystem, trilogy, occasionally peppered by messages of the device-mapper as above. As you see, the mounting operation can take as much as 55 seconds!

Skimming through the logs at that time is tedious, because of the huge level of information available here:

% awk '/2016-04-07T17/,/2016-04-07T17/ {c[$2]+=1} END {for(i in c){print(i, c[i]);}}' docker-obliterated.log 
level=info 148
level=debug 12703
level=error 28

All the errors look like this

% awk '/2016-04-07T17/,/2016-04-07T17/ { if($2 == "level=error") { print } }' docker-obliterated.log      
time="2016-04-07T17:02:50.621660570Z" level=error msg="Handler for GET /v1.17/containers/a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619/json returned error: no such id: a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619" 
time="2016-04-07T17:02:50.621694542Z" level=error msg="HTTP Error" err="no such id: a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619" statusCode=404 
…

and the complete docker logs around the first error mentioned above are

% grep -F 'time="2016-04-07T17:02:50.621660570Z"' -C 10 docker-obliterated.log
time="2016-04-07T17:02:49.108050685Z" level=debug msg="[devmapper] deactivateDevice END(a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619)" 
time="2016-04-07T17:02:49.791019449Z" level=debug msg="unregisterDevice(87, a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619)" 
time="2016-04-07T17:02:49.991149057Z" level=debug msg="devmapper: DeleteDevice(hash=a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619 syncDelete=false) END" 
time="2016-04-07T17:02:49.991260464Z" level=debug msg="devmapper: DeleteDevice(hash=a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619-init syncDelete=false) START" 
time="2016-04-07T17:02:49.991278921Z" level=debug msg="[devmapper] deactivateDevice(a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619-init)" 
time="2016-04-07T17:02:49.991328694Z" level=debug msg="[devmapper] deactivateDevice END(a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619-init)" 
time="2016-04-07T17:02:50.481024707Z" level=debug msg="unregisterDevice(86, a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619-init)" 
time="2016-04-07T17:02:50.617337446Z" level=debug msg="devmapper: DeleteDevice(hash=a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619-init syncDelete=false) END" 
time="2016-04-07T17:02:50.621477347Z" level=debug msg="Calling GET /v1.17/containers/a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619/json" 
time="2016-04-07T17:02:50.621503613Z" level=info msg="GET /v1.17/containers/a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619/json" 
time="2016-04-07T17:02:50.621660570Z" level=error msg="Handler for GET /v1.17/containers/a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619/json returned error: no such id: a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619" 
time="2016-04-07T17:02:50.621694542Z" level=error msg="HTTP Error" err="no such id: a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619" statusCode=404 
time="2016-04-07T17:02:50.623246522Z" level=debug msg="Calling GET /v1.17/containers/a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619/json" 
time="2016-04-07T17:02:50.623268575Z" level=info msg="GET /v1.17/containers/a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619/json" 
time="2016-04-07T17:02:50.623353372Z" level=error msg="Handler for GET /v1.17/containers/a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619/json returned error: no such id: a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619" 
time="2016-04-07T17:02:50.623374429Z" level=error msg="HTTP Error" err="no such id: a0614fd0be0d8e203dde77ab4a26e95d4cb6785e4b1ca067983052da17299619" statusCode=404 
time="2016-04-07T17:02:51.382312402Z" level=debug msg="Calling DELETE /v1.17/containers/72408de76334937793373d65746f5839130bf6bd58704b78f1b683816b9c6c20" 
time="2016-04-07T17:02:51.382362215Z" level=info msg="DELETE /v1.17/containers/72408de76334937793373d65746f5839130bf6bd58704b78f1b683816b9c6c20?v=1" 
time="2016-04-07T17:02:51.397667596Z" level=debug msg="devmapper: DeleteDevice(hash=72408de76334937793373d65746f5839130bf6bd58704b78f1b683816b9c6c20 syncDelete=false) START" 
time="2016-04-07T17:02:51.397690393Z" level=debug msg="[devmapper] deactivateDevice(72408de76334937793373d65746f5839130bf6bd58704b78f1b683816b9c6c20)" 
time="2016-04-07T17:02:51.397739315Z" level=debug msg="[devmapper] deactivateDevice END(72408de76334937793373d65746f5839130bf6bd58704b78f1b683816b9c6c20)" 

On the system, two services interact with docker: an ecs-agent (AWS) and a custom monitoring program polling docker ps and docker inspect for the given containers at a regular interval (5 minutes IIRC).

messages.log.gz docker-obliterated.log.gz

For obvious reasons, I obliterated Cmd and Env fields from the docker logs with the following sed script:

s/\\"Env\\":\[[^]]*\]/\\"Env\\":["XXX=XXX"]/g
s/\\"Cmd\\":\[[^]]*\]/\\"Cmd\\":["XXX"]/g

If some of this information could be relevant, please let me know.

@joaocc Thank you for your input, unfortunately this is not yet an option for us.

We now have several systems gathering extra debugging informations, given the rate at which the problem show up on these, it is likely that we can provide this information within one week or so.

On some systems (not gathering extra debugging information) we see that some docker containers never leave the state Created. I guess, the state Created is entered when the file-system for the container has been prepared and that the state should move to Starting right after that. But somehow this does not happen.

An additional interesting phenomenon on the affected systems is that top report a really heavy workload but none of the processes running on the system seem directly responsible for this. This usually hints towards a heavy kernel activity – which would likely correspond to the mounting-unmounting loop reported in my original description of the problem.

Is it possible that some system resource used by the devicemapper system is totally consumed and not reclaimed early enough, so that the system ends in this apparent deadlock? The question is prompted by these options we are using:

--storage-driver devicemapper
--storage-opt dm.fs=xfs
--storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool
--storage-opt dm.use_deferred_removal=true
--storage-opt dm.use_deferred_deletion=true

These are the logs (without debugging information available) around 2016-02-19T13:39:49.762851757Z the creation time of the first container that has been Created but not Started.

time="2016-02-19T13:39:40.487292428Z" level=info msg="GET /v1.21/containers/070218d69143/json" 
time="2016-02-19T13:39:48.642697524Z" level=info msg="POST /v1.17/images/create?fromImage=quintly%2FXXXXXXXXXX" 
time="2016-02-19T13:39:49.746898919Z" level=info msg="POST /v1.17/containers/create?name=ecs-XXXXXXXXXr-989c98fdd7bcea9a7600" 
time="2016-02-19T13:40:40.346575909Z" level=info msg="GET /v1.21/containers/json?all=1" 
time="2016-02-19T13:40:40.520502438Z" level=info msg="GET /v1.21/containers/0da4cb2cb36f/json" 
time="2016-02-19T13:41:16.264203431Z" level=error msg="Handler for GET /v1.21/containers/0da4cb2cb36f/json returned error: Unknown device 0da4cb2cb36fc35e74fe75dd82525b78c9c42801d582426ad66a8f8fc7117172" 
time="2016-02-19T13:41:16.264265246Z" level=error msg="HTTP Error" err="Unknown device 0da4cb2cb36fc35e74fe75dd82525b78c9c42801d582426ad66a8f8fc7117172" statusCode=500 
time="2016-02-19T13:41:16.264576864Z" level=info msg="GET /v1.21/images/0da4cb2cb36f/json" 
time="2016-02-19T13:41:16.264746648Z" level=error msg="Handler for GET /v1.21/images/0da4cb2cb36f/json returned error: No such image: 0da4cb2cb36f" 
time="2016-02-19T13:41:16.264768118Z" level=error msg="HTTP Error" err="No such image: 0da4cb2cb36f" statusCode=404 
time="2016-02-19T13:41:16.265048168Z" level=info msg="GET /v1.21/containers/b1d78171ab85/json" 

The inspection of the oldest of these containers gives

# docker inspect 0da4cb2cb36f
[
{
    "Id": "0da4cb2cb36fc35e74fe75dd82525b78c9c42801d582426ad66a8f8fc7117172",
    "Created": "2016-02-19T13:39:49.762851757Z",
    "Path": "/bin/sh",
    "Args": [
        "-c",
        "PLOP-PLOP FIZZ-FIZZ"
    ],
    "State": {
        "Status": "created",
        "Running": false,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 0,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "0001-01-01T00:00:00Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "fbcbd36743c13400a88bdf515650a290115d801c2bf720c7a734d9c7217ba714",
    "ResolvConfPath": "",
    "HostnamePath": "",
    "HostsPath": "",
    "LogPath": "",
    "Name": "/ecs-PLOP-PLOP-FIZZ-FIZZ-989c98fdd7bcea9a7600",
    "RestartCount": 0,
    "Driver": "devicemapper",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": [
            "/worker-logs:/quintly/logs"
        ],
        "ContainerIDFile": "",
        "LxcConf": null,
        "Memory": 524288000,
        "MemoryReservation": 0,
        "MemorySwap": 1048576000,
        "KernelMemory": 0,
        "CpuShares": 256,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": null,
        "Privileged": false,
        "PortBindings": null,
        "Links": null,
        "PublishAllPorts": false,
        "Dns": [],
        "DnsOptions": [],
        "DnsSearch": [],
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": null,
        "NetworkMode": "default",
        "IpcMode": "",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "",
            "MaximumRetryCount": 0
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "",
        "ConsoleSize": [
            0,
            0
        ],
        "VolumeDriver": ""
    },
    "GraphDriver": {
        "Name": "devicemapper",
        "Data": {
            "DeviceId": "194",
            "DeviceName": "docker-202:1-263198-0da4cb2cb36fc35e74fe75dd82525b78c9c42801d582426ad66a8f8fc7117172",
            "DeviceSize": "107374182400"
        }
    },
    "Mounts": [
        {
            "Source": "/worker-logs",
            "Destination": "/quintly/logs",
            "Mode": "",
            "RW": true
        }
    ],
    "Config": {
        "Hostname": "0da4cb2cb36f",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": false,
        "AttachStderr": false,
        "ExposedPorts": {
            "443/tcp": {},
            "80/tcp": {}
        },
        "Tty": false,
        "OpenStdin": false,
        "StdinOnce": false,
        "Env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ],
        "Cmd": [
            "/bin/sh",
            "-c",
            "PLOP-PLOP FIZZ-FIZZ"
        ],
        "Image": "PLOP-PLOP-FIZZ-FIZZ",
        "Volumes": null,
        "WorkingDir": "",
        "Entrypoint": null,
        "OnBuild": null,
    },
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": null,
        "SandboxKey": "",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "",
        "Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "",
        "IPPrefixLen": 0,
        "IPv6Gateway": "",
        "MacAddress": "",
        "Networks": null
    }
}
]

I stripped away the actual command used and labels used internally by the ecs-agent. The ids given can be tracked in the logs.

The logs were prepared on an already affected system, restarting the docker daemon with the --debug=true flag. Part of the observable activity might be ascribed to the broken state the daemon is in, even after restarting, but there is no detailed log available for the precise time when the trouble started.

Problems seems to start two days ago, as our ecs-agent (piece of software responsible for creating containers) repeatedly tries to instantiate new containers, at a rate of 10 in an hour. All these containers are listed in the output of docker ps -a with the status Created – not Exited(?) .

Here are the logs of the latest restart, using --debug=true. My comments follow in a separate entry.

docker.txt (849kB)