moby: `docker inspect` can get stuck for many minutes

We have production issues with docker 1.10.3 (binary from //get.docker.com/builds/Linux/x86_64/docker-1.10.3) where we observed:

  • docker inspect taking over 90 seconds
  • docker ps -a taking over 20 minutes
  • docker info taking over 2 minutes

None of those times are predictable, those are the longest times recorded.

Environment:

  • m4.4xlarge EC2 instances
  • amazon-ecs-agent 1.8.2
  • Ubuntu 16.04 AMI
  • overlay as a storage driver
  • journald as a logging driver
  • Plenty of margin for memory, CPU, network, storage I/O.

In all cases, docker info displayed fewer than 100 containers total. We don’t have a good way to reproduce, and this has only affected our production environment.

We have identified that this issue is caused by semaphore contention in datastore.(*datastore).List in the case of docker info. Here is how (TL;DR: wrote a poor person’s sampling profiler):

  1. Start docker with DEBUG=1 (or create /etc/docker/daemon.json with {"debug":true} and send SIGHUP to docker daemon).
  2. Expose Docker’s Unix socket with $ sudo socat -d TCP-LISTEN:8080,fork,bind=127.0.0.1 UNIX:/var/run/docker.sock
  3. Run https://gist.github.com/pcarrier/5cf3845f26f52ee4b8c42253ac09cecd and collect its output; leave it running in the background
  4. Run docker info. When it finishes, the script in (3) should terminate.
  5. Process the output from (3) with sort|uniq -c|sort -nr

The result:

    627 sync.runtime_Semacquire sync.(*Mutex).Lock github.com/docker/libnetwork/datastore.(*datastore).List github.com/docker/libnetwork.(*controller).getNetworksFromStore github.com/docker/libnetwork.(*controller).Networks
      1 syscall.Syscall syscall.read syscall.Read os.(*File).read os.(*File).Read
      1 encoding/json.fieldByIndex encoding/json.(*structEncoder).encode encoding/json.(*structEncoder).(encoding/json.encode)-fm encoding/json.(*structEncoder).encode encoding/json.(*structEncoder).(encoding/json.encode)-fm

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 24 (20 by maintainers)

Most upvoted comments

Yes. I’ve written another poor man’s sampling profiler to look at what holds the lock; I’ll try to collect that from my dev environment (not sure we can reliably reproduce the issue right now but I’ve seen it in the last few days; sadly I rotate instances pretty often in my dev env instead of creating separate ones).