moby: Sporadic RPC error "containerd: container did not start before the specified timeout"
Output of docker version
:
Client:
Version: 1.11.0
API version: 1.23
Go version: go1.5.4
Git commit: 4dc5990
Built: Wed Apr 13 18:17:17 2016
OS/Arch: linux/amd64
Server:
Version: 1.11.0
API version: 1.23
Go version: go1.5.4
Git commit: 4dc5990
Built: Wed Apr 13 18:17:17 2016
OS/Arch: linux/amd64
Output of docker info
:
Containers: 38
Running: 23
Paused: 0
Stopped: 15
Images: 37
Server Version: 1.11.0
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 4.4.0-0.bpo.1-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.24 GiB
Name: talon.one
ID: 4S3Y:6F4A:EYIB:DWKQ:C5C7:5RLX:YAM6:O426:BLGU:HP47:KN6J:FA5N
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No kernel memory limit support
Additional environment details (AWS, VirtualBox, physical, etc.): Physical box: Debian 8.4 Linux talon.one 4.4.0-0.bpo.1-amd64 #1 SMP Debian 4.4.6-1~bpo8+1 (2016-03-20) x86_64 GNU/Linux
Steps to reproduce the issue:
We build our image using Drone, and at the end of the build the new docker image is automatically deployed on the same server:
$ docker run --restart=always --name talon-master-api -e POSTGRES_PORT_5432_TCP_ADDR=talon-master-postgres --env-file=/home/talonone/secrets/salesforce --net=talon-master-nw --expose=9000 -d docker.talon.one/talon-api/master:latest
27625ea8e6ca8d59c2b501451846d476a0ce50f8c6d87a7e465be255d5a3de7a
Sometimes we get the response:
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container did not start before the specified timeout".
The issue is similar to https://github.com/docker/docker/issues/22053 but we don’t use docker compose.
I can also reproduce this issue by manually doing:
docker restart talon-master-api
The command then hangs for a while and outputs with intermittent kernel messages:
Message from syslogd@talon at Apr 21 13:59:30 ...
kernel:[71227.458071] unregister_netdevice: waiting for lo to become free. Usage count = 1
Error response from daemon: Cannot restart container talon-master-api: rpc error: code = 2 desc = "containerd: container did not start before the specified timeout"
So this issue might be triggered by / connected to the open issue https://github.com/docker/docker/issues/5618
I restarted the docker daemon with systemctl after adding the --debug flag in the systemd unit. This made the problem go away temporarly; I suspect this is because it cleans up the “waiting for lo to become free” bug.
After around 2 minutes, the “waiting for lo to become free” problem reappears.
While trying to reproduce this issue, I run into another problem (not sure if connected): docker start and ps see a different state of one of the containers:
root@talon /home/talonone # docker start demo-telco-master
Error response from daemon: Container 8b8eac92b62c06297ec87a1f27ef7e3d26aabd8b571bc026f88edbf9538d1e2c is aleady active
Error: failed to start containers: demo-telco-master
root@talon /home/talonone # docker ps | grep demo-telco-master
root@talon /home/talonone # docker ps -a | grep demo-telco-master
8b8eac92b62c docker.talon.one/demo-telco/master:latest "/bin/sh -c 'ruby app" 17 hours ago Exited (128) 6 minutes ago demo-telco-master
I will file a seperate issue for this.
In docker debug mode, i wasn’t quickly able to reproduce the main issue (container did not start). I will submit daemon debug logs as soon as it reappears.
I will gladly supply any needed debug info to help resolve the issue.
Additional information you deem important (e.g. issue happens only occasionally):
Dockerfile:
FROM alpine:latest
RUN apk --update upgrade && \
apk add ca-certificates && \
update-ca-certificates && \
rm -rf /var/cache/apk/*
COPY talon /talon/talon
WORKDIR /talon
ENV PATH=$PATH:/talon
CMD ["talon"]
EXPOSE 9000
Reference to discussion on twitter: https://twitter.com/mntmn/status/723108094620786688
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 12
- Comments: 74 (24 by maintainers)
This issue is quite serious and is a big time-eater. After waiting so long time, we just switched back to version 1.10.3 Looks like Docker team does not consider it serious.
I see it every time I run greater than 20 instances of the same container.
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
I see another one as well
docker: Error response from daemon: rpc error: code = 2 desc = "oci runtime error: read parent: connection reset by peer".
Docker version: $ docker -v Docker version 1.11.1, build
5604cbe
Docker info:
Ubuntu version: $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04 LTS Release: 16.04 Codename: xenial
Kernel version: $ uname -r 4.4.0-21-generic
I only tried https://github.com/docker/docker/issues/22690#issuecomment-218757471 and it seems to work well right now. But still need to have a look if it won’t crash again.
EDIT: it doesn’t appear extreme often anymore but at least it still happens from time to time