moby: docker events 'hangs' / stops working on unix socket (fifo)

[UPDATE] This problem occurs reading from fifo mounted inside docker container. For some reason unix sockets are not working. Works with a txt file / log file.

Calling the docker events command --> while read … loop. Via a FIFO (unix socket). It is meant to prints to stdout a new line of infos, whenever a container starts.

My code is here:

https://github.com/dreamcat4/docker-images/blob/master/pipework/entrypoint.sh#L368

To begin with, it works as it should. If I start 1 container, the while loop is triggered 1 time. Soon after that, I may restart the container (to generate a 2nd start event). And docker events will print out another line, triggering my while read … loop correctly once again. That is great. Well done.

However if I go away and leave the docker containers alone for “a while” (such overnight, or just 1 hour, or perhaps less < 1hour). Then I come back and start a container. There is nothing received in the unix pipe. And it seems almost like docker events is hung, or not flush anymore it’s stdout. Or else not printed a newline \n for to trigger the next read line.

If I kill $PID of docker events command’s process id. On a regular kill (which is a SIGTERM). Then sometimes doing that will prompt the hung docker events process to finally spit out whatever missing line(s) e.g. like a flush. Presumably during it’s cleanup, and then quit. Else sometimes it may not print anything and just dies / go away. Sorry that sounds very vague. But it is difficult to understand why this inconsistent behaviour.

My main point is that I really need some help here. Because I am not sure how to progress from this point, and debug further the problem going forwards. It is an important bug for me to solve, one way or another (and without dropping the functionality!).

Is there any new tests I can make to catch it red-handed? What else can I do? Perhaps write some test example script (not my ‘real’ code), with a loop, to print out ‘SUCCESS’ / ‘FAIL’ record. And a constantly re-starting container in another loop? E.g. with some date / time elapsed recorded?

How else to get more infos about the bug? What about strace command on docker events? Or how about if we try to attach gdb and get a live stacktrace when it stops responding? What can be honed in on first of all?

$ docker info
Containers: 4
Images: 70
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 78
Execution Driver: native-0.2
Kernel Version: 3.16.0-33-generic
Operating System: Ubuntu 14.10
CPUs: 2
Total Memory: 3.792 GiB

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 25 (8 by maintainers)

Most upvoted comments

I am guessing in the docker source code it is at here:

https://github.com/docker/docker/blob/v1.6.0-rc5/api/client/commands.go#L1877

Sorry I don’t understand that code where it goes into the function cli.stream. It is under the surface. Probably relying upon whatever standard GOlang I/O libraries to achieve that.

In addition, when I add a unix pipe on command line (to the end of that chain). Then somehow we end up in trouble and something breaks. As per the strace output. Where the futex(0x134e1b8, FUTEX_WAIT, 0, NULL. Does not finish printing that strace message.

I guess this is the relevant part of (new) strace output:

socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
epoll_create1(EPOLL_CLOEXEC) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1040633704, u64=140442176245608}}) = 0
getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
read(3, 0xc2080cd000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
write(3, "GET /v1.18/events HTTP/1.1\r\nHost"..., 95) = 95
epoll_wait(4, {{EPOLLIN|EPOLLOUT, {u32=1040633704, u64=140442176245608}}}, 128, 0) = 1
read(3, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 4096) = 116
read(3, 0xc2080cd000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(4, {}, 128, 0) = 0
epoll_wait(4, {{EPOLLIN|EPOLLOUT, {u32=1040633704, u64=140442176245608}}}, 128, -1) = 1
futex(0x134d480, FUTEX_WAKE, 1) = 1
read(3, "96\r\n{\"status\":\"start\",\"id\":\"fbc9"..., 4096) = 156
write(1, "2015-04-09T11:12:07.000000000+01"..., 36) = 36
write(1, "fbc922fbb30bb7c8eaf84cb87daec1a8"..., 66) = 66
write(1, "(from dreamcat4/tvheadend.debug:"..., 40) = 40
write(1, "start\n", 6) = 6
futex(0x134d480, FUTEX_WAKE, 1) = 1
futex(0x134d400, FUTEX_WAKE, 1) = 1
read(3, 0xc2080cd000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(4, {}, 128, 0) = 0
futex(0x134e1b8, FUTEX_WAIT, 0, NULL <---------- hangs here