moby: devicemapper: When docker service run a long time, could not restart docker service.

Description We use devicemapper as graph driver and run stable test as below:

  1. create/delete containers continuously.
  2. restart docker service randomly.

After one night, the docker could not startup.

The error is

messages:2017-06-07T10:41:43.932549+00:00 V2R1C00B052-GUESTOS-FS-KVM-X64 docker: time="2017-06-07T10:41:43.747631527Z" level=debug msg="devmapper: Error device setupBaseImage: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie dm_task_set_cookie failed"
messages:2017-06-07T10:41:43.932765+00:00 V2R1C00B052-GUESTOS-FS-KVM-X64 docker: time="2017-06-07T10:41:43.747827726Z" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie dm_task_set_cookie failed"

And more, I tried to use dmsetup remove to remove some device:

V2R1C00B052-GUESTOS-FS-KVM-X64:~ # dmsetup remove docker-8:2-402190-9623ed2972fa4f700eec99e1404959a5b7e64eac65d3ff541b22f6271c2ee38a
Limit for the maximum number of semaphores reached. You can check and set the limits in /proc/sys/kernel/sem.
Command failed
V2R1C00B052-GUESTOS-FS-KVM-X64:~ # 

So I use ipcs to check the ipcs:

V2R1C00B052-GUESTOS-FS-KVM-X64:~ # ipcs

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x0d4d3358 238977024  root       600        1
0x0d4d0ec9 270172161  root       600        1
0x0d4dc02e 281640962  root       600        1
0x0d4db8d2 291045379  root       600        1
0x0d4d4e76 291864580  root       600        1
0x0d4d825a 292388869  root       600        1
0x0d4d93ee 294256646  root       600        1
0x0d4da4a1 294879239  root       600        1
0x0d4d4125 295305224  root       600        1
.......
-->  128 cookie leaks, not list here.
......

And use dmsetup udevcookies to see the same as ipcs. cat /proc/sys/kernel/sem

V2R1C00B052-GUESTOS-FS-KVM-X64:~ # cat /proc/sys/kernel/sem
250     32000   32      128
V2R1C00B052-GUESTOS-FS-KVM-X64:~ # 

It is 128, so I echo an larger number of sem, it works:

echo 250 32000  32  1024 > /proc/sys/kernel/sem

And then Docker could startup.

So I supposed that there are semaphore leaks in DM. But I am not sure how does it happen…

And BTW, I could use dmsetup udevcomplete_all to cleanup all the leaks to recover the environment. But I think we should work out an solution against this situation.

Steps to reproduce the issue:

  1. create/delete containers continuously.
  2. kill docker randomly.

After some time, use dmsetup udevcookies to check if there is semaphore leak exists.

On other environment, We found leaks too. but very small (less than 10).

Describe the results you received: semaphore leak was found and reached the limit number, docker could startup.

Describe the results you expected: No semaphore leak, or cleanup it at docker startup. docker could works fine.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

root@localhost:~/workspace/huawei/docker# docker version
Client:
 Version:        1.11.2
 API version:    1.23
 Go version:     go1.7.1
 Git commit:     ff25c8a
 Built:          Thu Jun  8 15:37:09 2017
 OS/Arch:        linux/amd64

Server:
 Version:        1.11.2
 API version:    1.23
 Go version:     go1.7.1
 Git commit:     3515a27-unsupported
 Built:          Thu Jun  8 16:28:37 2017
 OS/Arch:        linux/amd64

Output of docker info:

root@localhost:~/workspace/huawei/docker# docker info
Containers: 210
 Running: 0
 Paused: 0
 Stopped: 210
Images: 11
Server Version: 1.11.2
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Hugetlb Pagesize: 2MB
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.6.0
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67 GiB
Name: localhost
ID: LB2C:RVJO:DK5F:GVNI:QFYC:DTII:C3UB:6QHS:754W:LE3G:BKPP:EUIY
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): true
 File Descriptors: 12
 Goroutines: 25
 System Time: 2017-06-09T01:43:24.994823746+08:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 12
  • Comments: 40 (20 by maintainers)

Commits related to this issue

Most upvoted comments

We see it in 17.06.0-ce as well.

Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 1
Server Version: 17.06.0-ce
Storage Driver: devicemapper
 Pool Name: docker-202:1-50535400-pool
 Pool Blocksize: 65.54kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Data file: /dev/loop2
 Metadata file: /dev/loop3
 Data Space Used: 386.5MB
 Data Space Total: 107.4GB
 Data Space Available: 27.86GB
 Metadata Space Used: 847.9kB
 Metadata Space Total: 2.147GB
 Metadata Space Available: 2.147GB
 Thin Pool Minimum Free Space: 10.74GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.26.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.26GiB
Name: ip-10-0-6-172
ID: FVV3:HC2U:WJYM:3MUJ:QUAO:7QB2:IB37:A46F:ZDGM:7LUX:6KF5:J2RK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: devicemapper: usage of loopback devices is strongly discouraged for production use.
         Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
docker run alpine:3.6
Unable to find image 'alpine:3.6' locally
3.6: Pulling from library/alpine
88286f41530e: Extracting [==================================================>]   1.99MB/1.99MB
docker: failed to register layer: devmapper: Error activating devmapper device for '44c920b0f7b9007935f246ccf446991f877613a7478d90da652b9a13c14023b3': devicemapper: Can't set cookie dm_task_set_cookie failed.

It looks like the backported fix for the semaphore leak issue just missed the cutoff for 17.06.1 and was moved to 17.06.2, based on this pull request.

@thaJeztah @vieux @cpuguy83 would it be possible to get a release of 17.06.2 or some other kind of patch with the pull request so this issue can finally be put to bed? This is a real pain to deal with on a daily basis and it looks like its starting to affect more and more people, based on the duplicate issues that keep getting created.

Having the same problem on centos 7.3.1611. Really annoying. Can’t start new containers anymore. always have to run that command sudo echo 'y' | sudo dmsetup udevcomplete_all to get them up. So a solution would help us greatly as well.

Anyone, who can’t upgrade docker now, just increase a semaphores limit to postpone your problem to a close future 😀

e.g. like this printf '250\t32000\t32\t200' >/proc/sys/kernel/sem

Docker:

Server Version: 1.12.6
Storage Driver: devicemapper
 Pool Name: docker-8:1-2752513-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 31 GB
 Data Space Total: 322.1 GB
 Data Space Available: 291.1 GB
 Metadata Space Used: 28.52 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.119 GB
 Thin Pool Minimum Free Space: 32.21 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge null host overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp
Kernel Version: 3.10.0-514.21.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64

Issue:

failed to register layer: devmapper: Error activating devmapper device for '5e86a0c5bb41f1e3b565adecefe6ca9fef4ae4dc21fb7e1fabbf59adc23c9ab4': devicemapper: Can't set cookie dm_task_set_cookie failed.

Same here, RHEL 7.3, docker v17.06.0-ce

Almost identical setup as above, docker v17.06.0-ce and on RHEL 7.2, I’d like to add that this problem has occurred recently when using docker stack and repeatedly starting and stopping stacks. In fact, the problem occurs on all my manager and worker nodes.

Hi,

I encountered the same problem on docker-ce 17.06/centos 7.3 after setting up devicemapper on a separate disk. After some containers it starts appearing. I managed to restart some containers after using echo 'y' | sudo dmsetup udevcomplete_all. But it now failed each time I want to start a new container (sad for a gitlab runner!).

Is there a long time work around or should I switch to another FS? The docker doc explain that docekr-ce on CentOS should use DM: https://docs.docker.com/engine/userguide/storagedriver/selectadriver/#docker-ce

Thanks for your help