moby: Sporadic RPC error "containerd: container did not start before the specified timeout"

Output of docker version:

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:17:17 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:17:17 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 38
 Running: 23
 Paused: 0
 Stopped: 15
Images: 37
Server Version: 1.11.0
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 4.4.0-0.bpo.1-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.24 GiB
Name: talon.one
ID: 4S3Y:6F4A:EYIB:DWKQ:C5C7:5RLX:YAM6:O426:BLGU:HP47:KN6J:FA5N
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No kernel memory limit support

Additional environment details (AWS, VirtualBox, physical, etc.): Physical box: Debian 8.4 Linux talon.one 4.4.0-0.bpo.1-amd64 #1 SMP Debian 4.4.6-1~bpo8+1 (2016-03-20) x86_64 GNU/Linux

Steps to reproduce the issue:

We build our image using Drone, and at the end of the build the new docker image is automatically deployed on the same server:

$ docker run --restart=always --name talon-master-api -e POSTGRES_PORT_5432_TCP_ADDR=talon-master-postgres --env-file=/home/talonone/secrets/salesforce --net=talon-master-nw --expose=9000 -d docker.talon.one/talon-api/master:latest
27625ea8e6ca8d59c2b501451846d476a0ce50f8c6d87a7e465be255d5a3de7a

Sometimes we get the response:

docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container did not start before the specified timeout".

The issue is similar to https://github.com/docker/docker/issues/22053 but we don’t use docker compose.

I can also reproduce this issue by manually doing:

docker restart talon-master-api

The command then hangs for a while and outputs with intermittent kernel messages:

Message from syslogd@talon at Apr 21 13:59:30 ...
 kernel:[71227.458071] unregister_netdevice: waiting for lo to become free. Usage count = 1
Error response from daemon: Cannot restart container talon-master-api: rpc error: code = 2 desc = "containerd: container did not start before the specified timeout"

So this issue might be triggered by / connected to the open issue https://github.com/docker/docker/issues/5618

I restarted the docker daemon with systemctl after adding the --debug flag in the systemd unit. This made the problem go away temporarly; I suspect this is because it cleans up the “waiting for lo to become free” bug.

After around 2 minutes, the “waiting for lo to become free” problem reappears.

While trying to reproduce this issue, I run into another problem (not sure if connected): docker start and ps see a different state of one of the containers:

root@talon /home/talonone # docker start demo-telco-master
Error response from daemon: Container 8b8eac92b62c06297ec87a1f27ef7e3d26aabd8b571bc026f88edbf9538d1e2c is aleady active
Error: failed to start containers: demo-telco-master
root@talon /home/talonone # docker ps | grep demo-telco-master
root@talon /home/talonone # docker ps -a | grep demo-telco-master
8b8eac92b62c        docker.talon.one/demo-telco/master:latest             "/bin/sh -c 'ruby app"   17 hours ago        Exited (128) 6 minutes ago                                             demo-telco-master

I will file a seperate issue for this.

In docker debug mode, i wasn’t quickly able to reproduce the main issue (container did not start). I will submit daemon debug logs as soon as it reappears.

I will gladly supply any needed debug info to help resolve the issue.

Additional information you deem important (e.g. issue happens only occasionally):

Dockerfile:

FROM alpine:latest

RUN apk --update upgrade && \
    apk add ca-certificates && \
    update-ca-certificates && \
    rm -rf /var/cache/apk/*

COPY talon /talon/talon
WORKDIR /talon
ENV PATH=$PATH:/talon 
CMD ["talon"]

EXPOSE 9000   

Reference to discussion on twitter: https://twitter.com/mntmn/status/723108094620786688

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 12
  • Comments: 74 (24 by maintainers)

Most upvoted comments

This issue is quite serious and is a big time-eater. After waiting so long time, we just switched back to version 1.10.3 Looks like Docker team does not consider it serious.

I see it every time I run greater than 20 instances of the same container.

docker run -d -p 5900 --privileged --link selenium-hub:hub -v /dev/shm:/dev/shm -v $PWD/config.json:/opt/selenium/config.json selenium/node-chrome

docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".

I see another one as well docker: Error response from daemon: rpc error: code = 2 desc = "oci runtime error: read parent: connection reset by peer".

Docker version: $ docker -v Docker version 1.11.1, build 5604cbe

Docker info:

$ docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 1.11.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 43
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.4.0-21-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 240.2 GiB
Name: ip-xx-xx-xx-xxx
ID: EQAD:4GMT:5DJF:KVOK:QW76:WCEP:QKPC:TGCT:JREP:CK5J:SIKO:TEEM
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

Ubuntu version: $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04 LTS Release: 16.04 Codename: xenial

Kernel version: $ uname -r 4.4.0-21-generic

I only tried https://github.com/docker/docker/issues/22690#issuecomment-218757471 and it seems to work well right now. But still need to have a look if it won’t crash again.

EDIT: it doesn’t appear extreme often anymore but at least it still happens from time to time