fluent-logger-golang: Docker daemon dies when fluent driver can not send messages to saturated fluent server
Got very chatty 25 nodes in elasticsearch dockerised cluster.
We use fluentd log driver for elasticssearch containers.
So saturation/slowness on fluent server (which stores data on standalone elasticsearch node in logstash format to view with kibana) led to panic and death of random elasticsearch containers.
From /var/log/syslog:
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: [signal 0xb code=0x1 addr=0x0 pc=0x70808c]
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: goroutine 125885 [running]:
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: net.(*TCPConn).Write(0x0, 0xc8235a4000, 0x11857, 0x30000, 0x0, 0x0, 0x0)
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: <autogenerated>:75 +0x2c
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: github.com/fluent/fluent-logger-golang/fluent.(*Fluent).send(0xc820eecb00, 0x0, 0x0)
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: /usr/src/docker/vendor/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:242 +0x1d6
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostRawData(0xc820eecb00, 0xc8230d6000, 0x118, 0x146)
Mar 4 01:41:02 es-100-pem-300-data-1 docker[21750]: /usr/src/docker/vendor/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:154 +0x171
From fluent forwarder log:
2016-03-04 00:54:49 +0000 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="docker.49c656b5827d"
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/buffer.rb:198:in `block in emit'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/2.2.0/monitor.rb:211:in `mon_synchronize'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/buffer.rb:187:in `emit'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/output.rb:448:in `emit'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/event_router.rb:88:in `emit_stream'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/event_router.rb:79:in `emit'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:186:in `on_message'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:276:in `call'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:276:in `block in on_read_msgpack'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:275:in `feed_each'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:275:in `on_read_msgpack'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:261:in `call'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:261:in `on_read'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/io.rb:123:in `on_readable'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/io.rb:186:in `on_readable'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/loop.rb:88:in `run_once'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/loop.rb:88:in `run'
2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:98:in `run'
2016-03-04 00:54:49 +0000 [error]: forward error error=#<Fluent::BufferQueueLimitError: queue size exceeds limit> error_class=Fluent::BufferQueueLimitError
2016-03-04 00:54:49 +0000 [error]: suppressed same stacktrace
Let me know if you need full stack trace or more details. We really eager to fix that.
Because even if I start bigger fluentd server and standalone elasticsearch node I want to make sure we will not start loosing data nodes due to panics during log spikes.
Related docker ticker: https://github.com/docker/docker/issues/20960
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 31 (13 by maintainers)
I have synched with R&D team and we have plan to use logstash with gelf docker logging driver as interim solution (long term is kafka driver https://github.com/docker/docker/issues/18604).
So I will stick with gelf driver for now.
Hope you or other contributor will find time to fix this production critical bug.
Otherwise people will go the sam route I did.