moby: [1.11.0] Possible deadlock on container object

Originally reported by @mblaschke in https://github.com/docker/docker/issues/13885#issuecomment-210639112

Creating a different issue because it may be a 1.11 regression.

https://gist.github.com/tonistiigi/9d79de62b2f7919f33a9e987619b9de8 goroutine trace seems to point that lots of goroutines are waiting on a container lock. No obvious goroutine that would keep a lock in that trace so possibly we have a codepath that returns without releasing.

Original report:

Since we updated to 1.11.0 running rspec docker image tests (~10 parallel containers running these tests on a 4 cpu machine) sometimes freezes and fails with a timeout. Docker freezes completely and doesn’t respond (eg. docker ps). This is happening on vserver with Debian strech (btrfs) and with (vagrant) Parallels VM Ubuntu 14.04 (backported kernel 3.19.0-31-generic, ext4).

Filesystem for /var/lib/docker on both servers was cleared (btrfs was recreated) after first freeze. The freeze happens randomly when running these tests.

Stack trace is attached from both servers: docker-log.zip

strace from docker-containerd and docker daemons:

# strace -p 21979 -p 22536
Process 21979 attached
Process 22536 attached
[pid 22536] futex(0x219bd90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21979] futex(0xf9b170, FUTEX_WAIT, 0, NULL

Docker info (Ubuntu 14.04 with backported kernel)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64
root@DEV-VM:/var/lib/docker# docker info
Containers: 11
 Running: 1
 Paused: 0
 Stopped: 10
Images: 877
Server Version: 1.11.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 400
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 3.19.0-31-generic
Operating System: Ubuntu 14.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 3.282 GiB
Name: DEV-VM
ID: KCQP:OGCT:3MLX:TAQD:2XG6:HBG2:DPOM:GJXY:NDMK:BXCK:QEIT:D6KM
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

Docker version (Ubuntu 14.04 with backported kernel)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 26 (10 by maintainers)

Most upvoted comments

This is confirmed fixed in 1.11.2, as far as I’m concerned.

On Mon, Jun 20, 2016, 03:10 Daniel Huhn notifications@github.com wrote:

Any updates yet?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/22124#issuecomment-227101970, or mute the thread https://github.com/notifications/unsubscribe/AAnFSE1WQlBzoNASW-fS4cVfUmOGIIWOks5qNmcCgaJpZM4IKDci .

I updated our prod cluster to 1.11.2. Now our monitoring (Datadog) reports the daemons going down sometimes but they become responsive again after a minute or two:

image

However this does now apply to all hosts, even they all run Ubuntu 14.04.4 LTS (with KVM) (3.13.0-88-generic)