moby: `docker inspect` can get stuck for many minutes
We have production issues with docker 1.10.3 (binary from //get.docker.com/builds/Linux/x86_64/docker-1.10.3) where we observed:
docker inspect
taking over 90 secondsdocker ps -a
taking over 20 minutesdocker info
taking over 2 minutes
None of those times are predictable, those are the longest times recorded.
Environment:
m4.4xlarge
EC2 instancesamazon-ecs-agent
1.8.2- Ubuntu 16.04 AMI
overlay
as a storage driverjournald
as a logging driver- Plenty of margin for memory, CPU, network, storage I/O.
In all cases, docker info
displayed fewer than 100 containers total. We don’t have a good way to reproduce, and this has only affected our production environment.
We have identified that this issue is caused by semaphore contention in datastore.(*datastore).List
in the case of docker info
. Here is how (TL;DR: wrote a poor person’s sampling profiler):
- Start docker with
DEBUG=1
(or create/etc/docker/daemon.json
with{"debug":true}
and sendSIGHUP
todocker daemon
). - Expose Docker’s Unix socket with
$ sudo socat -d TCP-LISTEN:8080,fork,bind=127.0.0.1 UNIX:/var/run/docker.sock
- Run https://gist.github.com/pcarrier/5cf3845f26f52ee4b8c42253ac09cecd and collect its output; leave it running in the background
- Run
docker info
. When it finishes, the script in (3) should terminate. - Process the output from (3) with
sort|uniq -c|sort -nr
The result:
627 sync.runtime_Semacquire sync.(*Mutex).Lock github.com/docker/libnetwork/datastore.(*datastore).List github.com/docker/libnetwork.(*controller).getNetworksFromStore github.com/docker/libnetwork.(*controller).Networks
1 syscall.Syscall syscall.read syscall.Read os.(*File).read os.(*File).Read
1 encoding/json.fieldByIndex encoding/json.(*structEncoder).encode encoding/json.(*structEncoder).(encoding/json.encode)-fm encoding/json.(*structEncoder).encode encoding/json.(*structEncoder).(encoding/json.encode)-fm
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 24 (20 by maintainers)
Yes. I’ve written another poor man’s sampling profiler to look at what holds the lock; I’ll try to collect that from my dev environment (not sure we can reliably reproduce the issue right now but I’ve seen it in the last few days; sadly I rotate instances pretty often in my dev env instead of creating separate ones).