fluent-bit: Multiline filter crashes after buffer limit is reached causing high cpu and memory usage
Bug Report
Describe the bug
Hello
Multiline filter is crashing on pods that generate a large amount of logs after reaching Emitter_Mem_Buf_Limit . On pods with a normal/low number of logs it works without problems
To Reproduce
[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inode=4540032 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inotify_fs_add(): inode=4540032 watch_fd=113 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:07] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4592513 watch_fd=111
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=3353782 watch_fd=112
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4540032 watch_fd=113
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inode=3353789 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=3353789 watch_fd=114 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ warn] [input] emitter.3 paused (mem buf overlimit)
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
- Steps to reproduce the problem:
This is my configuration (i left only the relevant parts):
[INPUT]
Name tail
Path /var/log/containers/*.log
DB /var/log/containers/fluentbit_db.db
Parser docker
Tag kube.*
Mem_Buf_Limit 10MB
Buffer_Chunk_Size 256k
Buffer_Max_Size 256k
Skip_Long_Lines On
Refresh_Interval 1
multiline.parser docker,cri
......
[FILTER]
name multiline
Emitter_Mem_Buf_Limit 2.4GB
match kube.*
multiline.key_content log
multiline.parser java,go,python
[FILTER]
Name kubernetes
Buffer_Size 512k
Match kube.*
Merge_Log On
Merge_Log_Key log_json
Merge_Log_Trim On
Keep_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
....
Your Environment
- Version used: Fluent Bit v1.8.12
- Environment name and version (e.g. Kubernetes? What version?): v1.20.4-eks
- Cpu and memory load after enabling the multiline filter. I tried increasing memory limit and emitter_buffer limit to a few GB and the processes were still crashing
Additional context
Fluentbit container keeps crashing after it gets to the memory limit configured for that container. Also a lot of logs like
[error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag:
are flooding fluentbit logs
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 24
- Comments: 47 (13 by maintainers)
I must say this is strange, that no one seems to be looking into this. Anyway I suggest you either return to your original version for now , or upgrade to 2.0.6. I think this is the last version that doesnt have this issue.
I found there is a new
buffer
config parameter, set it to off seems to avoid this issue. doc: https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace#configuration-parameters#8473 is the solution that should be used for this. I’ve closed my PR in favour of it.
I took a look at this issue today.
The root problem: the internal
I actually have no idea how to implement this behaviour; I couldn’t find a way for a filter to wait for an input plugin to resume without blocking the entire program (thus deadlocking it essentially, since the emitter wouldn’t be able to process the logs it has to make room so it can’t ever resume).
in_emitter
plugin that the multiline filter uses to emit the joined logs does not respect theFLB_INPUT_PAUSED
state. This state is set and checked automatically inflb_input_chunk.c
, and those routines will defensively refuse to append to chunks if the plugin is paused. This is not a problem for most input plugins, as they will simply pause and resume their collectors as the overall input instance is paused. This is not the case for the emitter, where any plugin can just append records to it at any time. So the multiline plugin indiscriminately slams the emitter with new logs, causing the emitter plugin to fail to append most of these and drop the logs. Worth noting that this is different from the documented behaviour in the emitter_mem_buf_limit fieldThis problem exists for both multiline and rewrite_tag, as their usage of emitter (just adding records directly to it) is the same.
I opened a PR shown above that cuts out the middleman and checks for the emitter to be paused before even trying to write to it. This will at least stop the spamming error log. Technically this issue is about high CPU, and this PR does not solve it. The multiline processor is very CPU intensive in general, so when it’s run as a filter on the main thread it’s no wonder it takes a whole CPU. I’m not sure what the solution to that is.
As a user, it is basically necessary with the design of this plugin to take steps to make the plugin resilient to higher throughputs. This is actually relatively the same as what you’d need to do to any input plugins, it’s just a bit odd because you wouldn’t think of needing to use those kinds of strategies on a filter plugin. However, to use this plugin effectively, it is essentially necessary to understand that under-the-hood detail of the
emitter
also being an input plugin which can suffer similar problems to any other input plugin.Here are the solutions I would recommend, which are not dissimilar to any input plugin:
storage.max_chunks_up
amount. These numbers all depend on what you set for the globalstorage
settings in[SERVICE]
(side note: it would be great to be able to set these for the emitter). Read more at https://docs.fluentbit.io/manual/administration/buffering-and-storageWe are experiencing exactly the same issue, with the main problem being that we actually need the functionality of the previously mentioned PR for the multiline logs. We use the amazon/aws-for-fluent-bit and run it next to our own containers using the built-in integration with firelens
[2022/03/07 10:10:52] [error] [input:emitter:emitter_for_multiline.2] error registering chunk with tag: xxx-firelens-GUID
The container has 50MB of allocated memory and it used to be more that enough, when now it used up the memory in 3 days and started just returning errors.@PettitWesley could you have a look if there are any issues in your implementation or if there are any workarounds?
Hi Ryan, I confirm that 2.0.9 version doesn’t have the CPU issue.
Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version. I wish someone would look into this already.
Hi terence. Thanks for the update. Indeed you are right , I just checked 2.0.9 with my application and cpu is Ok. So seems the regression started after 2.0.9. Im really surprised no one is looking at this yet. I would say this is a critical bug. https://github.com/fluent/fluent-bit/issues/7782
I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 https://github.com/fluent/fluent-bit/issues/7782
I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.
For anyone facing similar issues, we tried a lot of different configurations including increasing memory limits but it didn’t work. In the end, we had to replace fluent-bit with fluentd to make our logging work with the multiline parser.
I know it’s ugly, but since i really need this working, i did setup a cronjob to rollout restart the fluent-bit pods every day at 03:00am.
I have added this on my personal backlog, which unfortunately right now is very long.
My recommendation is to take a look at past issues involving the rewrite_tag filter, with the new code, the multiline filter with buffer is very similar to rewrite_tag. This same issue should have arisen with that filter as well.