moby: Docker daemon is leaking file descriptors (with reproduction)
Description
Not sure this is related, but I have in production after a few days / weeks that dockerd starts leaking tons of file descriptors of dead connections that it simply stops responding at all (not even to SIGUSR1).
While trying to understand how this can happen (since when it happens I can’t do anything but kill it), I tried to figure out how you get a dead connection, and have found some conclusions on how to get dockerd to leak such file descriptors, although I’m not sure that what I’m about to say is the issue that is happening in production.
How do those leaked file descriptors look ? I run the command on linux:
ss | grep docker.sock
and I see tons of entries that most look like this:
u_str ESTAB 0 0 /var/run/docker.sock 272245 * 0
u_str ESTAB 0 0 /var/run/docker.sock 279780 * 0
u_str ESTAB 0 0 /var/run/docker.sock 279118 * 0
u_str ESTAB 0 0 /var/run/docker.sock 272201 * 0
u_str ESTAB 0 0 /var/run/docker.sock 272217 * 0
You can see this are dead connections because the peer info (the * 0 in the end) points to nothing which means the other side closed the unix domain socket, but the docker daemon did not.
also this sockets take up file descriptors (as expected): ls -l /proc/$(pidof dockerd)/fd … lrwx------ 1 root root 64 may 31 08:37 33 -> socket:[272217] …
an alive connection looks like this (there is peer info): u_str ESTAB 0 0 /var/run/docker.sock 280489 * 279381
I’m going now to show you a convoluted way to cause this resource leak, but remember that still dockerd shouldn’t leak resources and that in the end there might be similar code that really needs to do something like this…
This also seems to happen on variety of docker versions from old to new.
Steps to reproduce the issue:
- docker create -it --rm python:2.7 python -c “while True: print 1;” <container id>
- docker start <container id> <container id>
- python2
import socket
s = socket.socket(socket.AF_UNIX)
s.connect('/var/run/docker.sock')
s.send('POST /v1.24/containers/<container id>/attach?logs=1&stream=0&stdout=1 HTTP/1.1\r\nHost: localhost\r\n\r\n')
s.recv(1024)
s.close()
You can also try changing the logs=1&stream=0 with logs=0&stream=1 Also don’t think the v1.24 is of particulate importance. Also you can kill the python instead of s.close() (sometimes it acts differently)
- ss | grep docker.sock u_str ESTAB 0 0 /var/run/docker.sock 280489 * 0
Describe the results you received:
There is a dead unclosed unix domain socket on the docker daemon side.
Describe the results you expected:
The docker daemon should figure the socket has been closed on the other side, and close it’s side as well.
Versions
Tried on multiple versions, now both client and server: 17.12.1-ce and on ubuntu 16.04.4 with kernel 4.13.0-36-generic with default settings.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 38 (23 by maintainers)
dockerd version 18.09.1 containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
Find fd leaking in running server, and the stack shows all goroutines waiting for container stopped. However, most of containers waited are already stopped.
Server Version: 19.03.6 containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339 runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657 init version: fec3683
After reproducton, hung when try to remove the container forcely or restart the docker service.
Only fixed by
As the stack shows, some deadlock happened.
Two new generated goroutine trying to lock (0xc000dc9000)
This goroutine already existed before reproduction. address (0xc000dc9000) used when pool.Copy.
@dannyk81 it looks like https://github.com/moby/moby/issues/37182#issuecomment-509681877 covers most of what I was about to say, hope it helps and please feel free to follow up with any questions. that repo’s “overlay-runner” does not suffer from this defect on any version of docker.