moby: Sometimes the device mapper is stuck in an infinite “mount/remount” loop
When this happens, the system is very unresponsive, a simple docker ps can last for about 1 minute, and containers need even more time to be started, if they start at all. Application logs and docker logs do not display any error, the only low-level curious activity is found in the system logs:
Feb 6 23:58:04 ip-172-16-4-30 kernel: [723943.329608] XFS (dm-26): Unmounting Filesystem
Feb 6 23:58:24 ip-172-16-4-30 kernel: [723963.978634] XFS (dm-26): Mounting V4 Filesystem
Feb 6 23:59:22 ip-172-16-4-30 kernel: [724021.561906] XFS (dm-26): Ending clean mount
Feb 6 23:59:22 ip-172-16-4-30 kernel: [724021.564932] XFS (dm-267): Unmounting Filesystem
Feb 6 23:59:42 ip-172-16-4-30 kernel: [724041.461206] XFS (dm-111): Mounting V4 Filesystem
Feb 7 00:00:42 ip-172-16-4-30 kernel: [724101.702141] XFS (dm-111): Ending clean mount
This is a not so rare condition, it affects about 3% of the systems deployed. There is about one iteration each second. It can last over 1 day and there is no obvious way to resolve it, aside ditching the systems and ordering brand new ones. (Service restart, reboot, or docker library removal will not salvage the system.)
$ docker info
Containers: 3
Images: 17
Server Version: 1.9.1
Storage Driver: devicemapper
Pool Name: docker-202:1-264750-pool
Pool Blocksize: 65.54 kB
Base Device Size: 107.4 GB
Backing Filesystem:
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 795.2 MB
Data Space Total: 107.4 GB
Data Space Available: 27.62 GB
Metadata Space Used: 2.032 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.145 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.7-15.23.amzn1.x86_64
Operating System: Amazon Linux AMI 2015.09
CPUs: 2
Total Memory: 3.862 GiB
Name: ip-172-16-1-241
ID: MKMR:UVOD:VZ5M:AZWU:5CH7:YXWZ:UOI3:FY4R:YBZA:3MU4:AFG7:CH7U
Username: quintlyworker
Registry: https://index.docker.io/v1/
$ docker version
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5/1.9.1
Built:
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5/1.9.1
Built:
OS/Arch: linux/amd64
$ uname -a
Linux ip-172-16-1-241 4.1.7-15.23.amzn1.x86_64 #1 SMP Mon Sep 14 23:20:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 8
- Comments: 50 (14 by maintainers)
Commits related to this issue
- Add devicemapper issue (docker/docker#20401) docker/docker#20401: Sometimes the device mapper is stuck in an infinite “mount/remount” loop — committed to AkihiroSuda/issues-docker by AkihiroSuda 8 years ago
- Still effing around with https://github.com/docker/docker/issues/20401 — committed to indigobio/empire_ami by gswallow 7 years ago
Hi All, I can still see this issue with Docker 1.12.5 (running inside Kubernetes). Is there any way to fix it?
Just compare loop based device-mapper with lvm-thin-pool based device-mapper, it seems that the “stuck” happened only on nodes with lvm-thin-pool based device-mapper.
I have exact issue with docker 1.12.6/centos7.3
I was stress testing kubernetes pod creation, which would start about 100*2 docker container on each node, but on some node, it would take a very long time to start running those docker containers, most of the containers were stuck in “Created” status. But eventually, after about 1 hours, the containers would start to run. (I am not sure whether docker recovered itself or kubernets did some help) When containers were stucked in “Created”, following logs were observed in system log
So, I would have liked to provide output of
docker inspect
for all containers on the machine butdocker ps
did not respond after more than one hour (!). First, let’s take a look at top:As you see the load average (2.41) is rather high (I saw peeks at 3.15) while no process seems to cause this activity, which is a sign than this activity takes place in the kernel. While investigating the system, my SSH connection dropped (SIGPIPE) which of course can have a variety of causes but could be a further symptom than some system resource is short.
This is the last segment of very strange activity
after about 1200 similar messages comes the message
Then come about 1000 similar lines with the ending clean-mount, unmounting filesystem mounting, v4 filesystem, trilogy, occasionally peppered by messages of the device-mapper as above. As you see, the mounting operation can take as much as 55 seconds!
Skimming through the logs at that time is tedious, because of the huge level of information available here:
All the errors look like this
and the complete docker logs around the first error mentioned above are
On the system, two services interact with docker: an ecs-agent (AWS) and a custom monitoring program polling
docker ps
anddocker inspect
for the given containers at a regular interval (5 minutes IIRC).messages.log.gz docker-obliterated.log.gz
For obvious reasons, I obliterated
Cmd
andEnv
fields from the docker logs with the following sed script:If some of this information could be relevant, please let me know.
@joaocc Thank you for your input, unfortunately this is not yet an option for us.
We now have several systems gathering extra debugging informations, given the rate at which the problem show up on these, it is likely that we can provide this information within one week or so.
On some systems (not gathering extra debugging information) we see that some docker containers never leave the state
Created
. I guess, the stateCreated
is entered when the file-system for the container has been prepared and that the state should move toStarting
right after that. But somehow this does not happen.An additional interesting phenomenon on the affected systems is that
top
report a really heavy workload but none of the processes running on the system seem directly responsible for this. This usually hints towards a heavy kernel activity – which would likely correspond to the mounting-unmounting loop reported in my original description of the problem.Is it possible that some system resource used by the devicemapper system is totally consumed and not reclaimed early enough, so that the system ends in this apparent deadlock? The question is prompted by these options we are using:
These are the logs (without debugging information available) around
2016-02-19T13:39:49.762851757Z
the creation time of the first container that has beenCreated
but notStarted
.The inspection of the oldest of these containers gives
I stripped away the actual command used and labels used internally by the ecs-agent. The ids given can be tracked in the logs.
The logs were prepared on an already affected system, restarting the docker daemon with the
--debug=true
flag. Part of the observable activity might be ascribed to the broken state the daemon is in, even after restarting, but there is no detailed log available for the precise time when the trouble started.Problems seems to start two days ago, as our ecs-agent (piece of software responsible for creating containers) repeatedly tries to instantiate new containers, at a rate of 10 in an hour. All these containers are listed in the output of
docker ps -a
with the statusCreated
– notExited(?)
.Here are the logs of the latest restart, using
--debug=true
. My comments follow in a separate entry.docker.txt (849kB)