moby: Docker can hang indefinitely waiting for a nonexistant process to pull an image.

Running docker pull will simply hang waiting for a non-existant process to download the repository.

root@ip-172-31-18-106:~# docker pull ubuntu:trusty
Repository ubuntu already being pulled by another client. Waiting.

This is the same behavior as #3115 however there is no other docker process running.

The list of running docker containers:

# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

See here for a full process tree: https://gist.github.com/tfoote/c8a30e569c911f1977e2

When this happens my process monitor fails the job after 120 minutes, which happens regularly.

An strace of the docker instance can be found here: https://gist.github.com/tfoote/1dc3905eb9c235cb5c53

it is stuck on an epoll_wait call.

Here’s all the standard info.

root@ip-172-31-18-106:~# docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.4.1
Git commit (client): a8a31ef
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.4.1
Git commit (server): a8a31ef

root@ip-172-31-18-106:~# docker -D info
Containers: 132
Images: 6667
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 6953
Execution Driver: native-0.2
Kernel Version: 3.13.0-44-generic
Operating System: Ubuntu 14.04.1 LTS
CPUs: 4
Total Memory: 14.69 GiB
Name: ip-172-31-18-106
ID: SZWS:VD6O:CLP2:WRAM:KWIL:47HZ:HOEY:SR6R:ZOWR:E3HG:PS7P:TCZP
Debug mode (server): false
Debug mode (client): true
Fds: 27
Goroutines: 32
EventsListeners: 0
Init Path: /usr/bin/docker
Docker Root Dir: /var/lib/docker
WARNING: No swap limit support

root@ip-172-31-18-106:~# uname -a
Linux ip-172-31-18-106 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

It’s running on AWS.

I’m running an instance of the ROS buildfarm which can reproduce this bad state once every couple days when fully loaded running debian package builds at ~ 100% cpu load. This happens when we are preparing a major release.

I have not been able to isolate the cause in a smaller example, it has happened on multiple different repositories. Sometimes it’s the official Ubuntu repository, sometimes it’s our own custom repositories. We’ve tracked a few instances of this error recently here. When one repository is failing to pull, others work fine. All the repositories are hosted on the public docker hub.

Here’s an example of one hanging while another passes.

root@ip-172-31-18-106:~# docker pull ubuntu:saucy
Pulling repository ubuntu
^Croot@ip-172-31-18-106:~# docker pull ubuntu:saucy^C
root@ip-172-31-18-106:~# docker pull osrf/ubuntu_32bit
Pulling repository osrf/ubuntu_32bit
FATA[0000] Tag latest not found in repository osrf/ubuntu_32bit 
root@ip-172-31-18-106:~# docker pull osrf/ubuntu_32bit:saucy
Pulling repository osrf/ubuntu_32bit
d6a6e4bd19d5: Download complete 
Status: Image is up to date for osrf/ubuntu_32bit:saucy

As determined in #3115 this can be fixed by restarting docker. However from that issue it is expected that this issue should not happen anymore. I think there has been a regression or we’ve found another edge case.

I will keep the machine online for a few days if anyone has suggestions on what I can run to debug the isse. Otherwise I’ll have to wait for it to reoccur to be able to test any debugging.

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Reactions: 3
  • Comments: 92 (24 by maintainers)

Commits related to this issue

Most upvoted comments

Same here (restarting docker daemon solves the issue however)

I’m seeing this happen on AWS / ECS - we do a docker pull and for some reason the network connection drops. Then our deploy is stuck since the pull hangs indefinitely.

Same problem here. This is triggered by network reconfigure of host. “Fixed” via “docker-machine restart”

After this long not even an response from the team? It can’t be only us suffering from this problem?

We have been seeing a problem with identical symptoms. Most failed builds of Helios have tests that fail due to this issue.