moby: Container hangs when using fluentd logger

Description Docker container hangs when using fluentd logger with fluentd-async-connect=true and unreachable (turned off) fluentd server.

Steps to reproduce the issue:

  1. Create docker-compose.yml with any container and logger set to
logging:
      driver: fluentd
      options:
        fluentd-address: 10.147.49.79:24224
        fluentd-async-connect: 'true'
        fluentd-retry-wait: '1s'
        fluentd-max-retries: '30'
        tag: your_app
  1. docker-compose up --build -d

Describe the results you received: Docker container hangs, does not respond to docker-compose down -f, docker stop, docker kill.

Describe the results you expected: Container exits with error message.

Additional information you deem important (e.g. issue happens only occasionally): Issue happens every time the remote fluentd server is down or unreachable.

Output of docker version:

Client: Docker Engine - Community
 Version:           19.03.2
 API version:       1.40
 Go version:        go1.12.8
 Git commit:        6a30dfc
 Built:             Thu Aug 29 05:28:55 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.8
  Git commit:       6a30dfc
  Built:            Thu Aug 29 05:27:34 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 30
  Running: 29
  Paused: 0
  Stopped: 1
 Images: 47
 Server Version: 19.03.2
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-957.27.2.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 64
 Total Memory: 125.3GiB
 Name: server55
 ID: SHMQ:YUKH:MZOS:EEYR:WZI6:JYVH:NPPJ:CEEQ:ILXV:6JNU:BVBG:2NTE
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled```

**Additional environment details (AWS, VirtualBox, physical, etc.):**
physical

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 30
  • Comments: 35 (11 by maintainers)

Commits related to this issue

Most upvoted comments

+1 on this issue

I have the same problem too

Anything new on this? Thanks

+1 When will this be resolved? It looks like PR has not been merged yet

Is there a plan to bump fluent-logger-golang version to 1.5.0 so that this issue is resolved in Docker version 19.03? Or is there any workaround to make this work in Docker version 19.0.3?

I was going to open a new issue to post my full bug report on this issue but I think we experience the same issue so I’m going to post it here.

Some weeks ago, a teammate experienced docker freezes when using fluentd logger whereas fluentd was down when containers were created and despite the mode=non-blocking and fluentd-async-connect=true options, with docker v19.03. To reproduce the bug you have to:

# Don't start any fluentd server
$ docker run --name repro -d -p 80:80 --log-driver fluentd --log-opt=fluentd-async-connect=true nginx

$ curl http://locahost
# This will generate some logs

$ docker stop repro
# Will hang out infinitely

After generating and analysing the goroutine stacktraces, it appears that there’re some goroutine blocked in semacquire state, in the fluent-logger-golang package:

goroutine 1908 [semacquire, 23 minutes]:
sync.runtime_Semacquire(0xc005964560)
	/usr/lib/go/src/runtime/sema.go:56 +0x44
sync.(*WaitGroup).Wait(0xc005964558)
	/usr/lib/go/src/sync/waitgroup.go:130 +0x66
github.com/docker/docker/vendor/github.com/fluent/fluent-logger-golang/fluent.(*Fluent).Close(0xc0059644d0, 0xc006326140, 0x55618f040db0)
	/build/docker/src/src/github.com/docker/docker/vendor/github.com/fluent/fluent-logger-golang/fluent/fluent.go:297 +0x6a
github.com/docker/docker/daemon/logger/fluentd.(*fluentd).Close(0xc0063260c0, 0x55618f040db0, 0x0)
	/build/docker/src/src/github.com/docker/docker/daemon/logger/fluentd/fluentd.go:181 +0x31
github.com/docker/docker/daemon/logger.(*RingLogger).Close(0xc000e57a40, 0xc005880e70, 0x2)
	/build/docker/src/src/github.com/docker/docker/daemon/logger/ring.go:103 +0x3fc
github.com/docker/docker/container.(*Container).Reset(0xc0053cdd40, 0x55618deb6d00)
	/build/docker/src/src/github.com/docker/docker/container/monitor.go:42 +0x109
github.com/docker/docker/daemon.(*Daemon).ProcessEvent(0xc0008a01e0, 0xc000d69380, 0x40, 0x55618cfd1515, 0x4, 0xc000d69380, 0x40, 0xc000d693c0, 0x40, 0x1000318eb, ...)
	/build/docker/src/src/github.com/docker/docker/daemon/monitor.go:61 +0x4d1
github.com/docker/docker/libcontainerd/remote.(*client).processEvent.func1()
	/build/docker/src/src/github.com/docker/docker/libcontainerd/remote/client.go:634 +0x109
github.com/docker/docker/libcontainerd/queue.(*Queue).Append.func1(0xc000d69300, 0x0, 0xc0007da090, 0xc000cb8ea0, 0xc000765d80, 0xc000d69380, 0x40)
	/build/docker/src/src/github.com/docker/docker/libcontainerd/queue/queue.go:28 +0x3a
created by github.com/docker/docker/libcontainerd/queue.(*Queue).Append
	/build/docker/src/src/github.com/docker/docker/libcontainerd/queue/queue.go:24 +0x1cf

After a more thorough analysis, I found out that:

In 688e67e1d3, @thaJeztah bumped fluent/fluent-logger-golang from v1.3.0 to v1.4.0. By looking at the diff between v1.3.0 and v1.4.0, it appears that (*Fluent).Close() method changed: there were no call to wg.Wait() in previous versions.

Since we run with async option, the wait group is incremented at this line. The HTTP call shown above triggers that select, and even if the channel f.pending is closed there, since the first log triggered the select, it’s still blocked on that call to f.write() and then it retries to connect to the fluentd server which is down for a long period of time (actually using an exponential back-off retry strategy). And this is the reason of the freezes we experience.

Finally, we had to downgrade to v18.09 to get a working version of Docker and fluentd logger.

EDIT: Actually it looks like this PR on fluent-logger-golang (and PR #40293 on this repo) are trying to solve this exact issue.

When can we expect this issue to be fixed. It is a blocker for me

Yes, it will require a new docker release. I still feel like the solution provided in fluentd is not great. I’m not sure how we can expose the new functionality w/o requiring the user to explicitly set the option… which also means no backport to 19.03.

Alternatively someone could make a plugin that uses the latest fluentd release.

Also note that the version of fluentd has not been updated here, and this option is not exposed in anyway yet.

Agree that it will not be backward compatible. However, currently there is no solution/workaround in sight. Even simple commands like stop containers lead to a hung state. Request for prioritising this for the major release.

How do you create a plugin with the latest fluentd for 19.03 docker? Will this mean that we don’t need a new docker release?

The only solution that works for me on this case is:

systemctl stop docker
mount | grep /containers/ | cut -d' ' -f3 | xargs umount
rm -rf /var/lib/docker/containers
systemctl start docker

Then start the containers as usual.

any updates or workaround guys, I think that I couldn’t wait until 19.04, and so could my penguin 🐧

Thanks for the update @akerouanton, I can see that tests are now passing thanks to the changes you made to CI

PR under review: https://github.com/fluent/fluent-logger-golang/pull/82

@NiR How do I swap the broken fluentd logging driver currently packaged in Docker with your v0.1 of akerouanton/fluentd-async-logger? I’m hoping Docker will fix this soon but until then happy to use your fix.

I added some test cases (TestCloseOnFailingAsyncConnect && TestCloseOnFailingAsyncReconnect) to fluent-logger-golang. My two PRs are still in review, but I published a v0.1 of akerouanton/fluentd-async-logger based on my fixes. Feel free to use it if you face the issue described here.

I’m currently working on a logging driver based on the fixed version of fluent-logger-golang and I’m wondering if ForceStopAsyncSend should be set to true whenever RequestAck is false (as this is already somehow encoding delivery expectations).

I think the issue described here is quite expectable in the case of RequestAck=true + Async=true whenever the fluentd server is down. In the other hand, with RequestAck=false (the default) and Async=true, people are actually expecting the logger to stop as soon as possible.

Moreover, it means no additional option have to be exposed to Docker users thus, I guess, this could be backported to v19.03. If these assumptions seem reasonable to Docker maintainers, I’m willing to send a patch.

EDIT: Actually ForceStopAsyncSend does not work as it’s used by a select run after the select that try to write the log (and the write method is actually the one using exponential back-off retry to lazily and asynchronously connect to fluentd). But I’m gonna send a patch to fluent-logger-golang to get it working properly.

I’ve commented on that issue. Sorry for missing it.