fluentd: fluentd is stuck in loop when overflow_action block occurs
Describe the bug
I am running fluentd in a kubernetes cluster. I have one input forwarding plugin and two output buffered plugin. When I set the output plugin to overflow_action block
in the logging plugin, it seems like fluentd gets stuck in a loop and never recovers.
To Reproduce This is my config:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-log-aggregator-config
namespace: kube-system
data:
fluent.conf: |
<source>
@type forward
port 24225
@log_level debug
</source>
<source>
@type monitor_agent
bind 0.0.0.0
port 24220
</source>
<match ** >
@type copy
<store ignore_error>
@type logging
@log_level debug
url xxxxxx
<buffer>
@type memory
chunk_limit_size 8m
queued_chunks_limit_size 256
flush_thread_count 5
flush_interval 10s
flush_mode interval
retry_max_times 8
retry_wait 1s
retry_max_interval 10s
overflow_action block
</buffer>
</store>
<store ignore_error>
@type metrics_collector
publish_metrics_interval 30
flush_interval 20s
monitor logging
</store>
</match>
This is the error I get:
2019-08-19 15:38:46 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=59.66392836498562 slow_flush_log_threshold=20.0 plugin_id="object:3ffff7d7e114"
2019-08-19 15:39:20 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=33.92090093600564 slow_flush_log_threshold=20.0 plugin_id="object:3ffff7d7e114"
2019-08-19 15:39:27 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
2019-08-19 15:39:27 +0000 [debug]: #0 buffer.write is now blocking
2019-08-19 15:39:27 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
2019-08-19 15:39:27 +0000 [debug]: #0 buffer.write is now blocking
2019-08-19 15:39:58 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
2019-08-19 15:39:58 +0000 [debug]: #0 buffer.write is now blocking
When I enable @log_level trace
, I get the following logs:
2019-08-19 17:06:33 +0000 [trace]: #0 adding metadata instance=70368625388440 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil>
2019-08-19 17:06:33 +0000 [trace]: #0 adding metadata instance=70368625388440 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil>
2019-08-19 17:06:33 +0000 [trace]: #0 writing events into buffer instance=70368625388440 metadata_size=1
2019-08-19 17:06:33 +0000 [trace]: #0 writing events into buffer instance=70368625388440 metadata_size=1
2019-08-19 17:06:34 +0000 [trace]: #0 enqueueing all chunks in buffer instance=70368625388440
2019-08-19 17:06:34 +0000 [trace]: #0 enqueueing chunk instance=70368625388440 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil>
2019-08-19 17:06:34 +0000 [trace]: #0 adding metadata instance=70368625388440 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil>
2019-08-19 17:06:34 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
2019-08-19 17:06:34 +0000 [debug]: #0 buffer.write is now blocking
2019-08-19 17:06:34 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:06:35 +0000 [trace]: #0 adding metadata instance=70368625388440 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil>
2019-08-19 17:06:35 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
2019-08-19 17:06:35 +0000 [debug]: #0 buffer.write is now blocking
2019-08-19 17:06:35 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:06:35 +0000 [trace]: #0 enqueueing all chunks in buffer instance=70368625388440
2019-08-19 17:06:35 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:06:36 +0000 [trace]: #0 sleeping until buffer can store more data
...
2019-08-19 17:09:51 +0000 [trace]: #0 enqueueing all chunks in buffer instance=70368625388440
2019-08-19 17:09:51 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:09:52 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:09:52 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:09:52 +0000 [trace]: #0 enqueueing all chunks in buffer instance=70368625388440
2019-08-19 17:09:52 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:09:53 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:09:53 +0000 [trace]: #0 sleeping until buffer can store more data
2019-08-19 17:09:53 +0000 [trace]: #0 enqueueing all chunks in buffer instance=70368625388440
From the logs I can see that it is stuck in the inifinite loop and it never exits. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/output.rb#L885
Expected behavior The expected behaviour is that fluentd calls the write function when blocking. If we do not call the write then the buffer will never be emptied out so we will be stuck in the loop.
Your Environment
- Fluentd v1.6.3, ruby v2.6.3
- using fluentd debian image
- running fluentd in kubernetes
Someone also reported a similar issue: https://github.com/uken/fluent-plugin-elasticsearch/issues/609
@repeatedly wondering if you have seen the same thing? If so I can create a PR to fix the issue.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (15 by maintainers)
Yep. You are right. flush thread won’t be blocked.
Here is another diff and log:
If a storage which will store records from Fluentd has a capacity, it won’t be blocked forever:
Maybe, yes. The previous log caused by index writing freezed Elasticsearch cluster. For example, exhausted Elasticsearch cluster didn’t permit incoming bulk request, then Fluentd cannot process flush thread. So, flushing process won’t progress and then
buffer#storable?
always returnsfalse
.I used the following configuration to reproduce this issue.