fluent-bit: Multiline filter crashes after buffer limit is reached causing high cpu and memory usage

Bug Report

Describe the bug

Hello

Multiline filter is crashing on pods that generate a large amount of logs after reaching Emitter_Mem_Buf_Limit . On pods with a normal/low number of logs it works without problems

To Reproduce

[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inode=4540032 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inotify_fs_add(): inode=4540032 watch_fd=113 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:07] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4592513 watch_fd=111
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=3353782 watch_fd=112
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4540032 watch_fd=113
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inode=3353789 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=3353789 watch_fd=114 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ warn] [input] emitter.3 paused (mem buf overlimit)
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log

  • Steps to reproduce the problem:

This is my configuration (i left only the relevant parts):

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        DB /var/log/containers/fluentbit_db.db
        Parser docker
        Tag kube.*
        Mem_Buf_Limit 10MB
        Buffer_Chunk_Size 256k
        Buffer_Max_Size 256k
        Skip_Long_Lines On
        Refresh_Interval 1
        multiline.parser docker,cri
......


    [FILTER]
        name multiline
        Emitter_Mem_Buf_Limit 2.4GB
        match kube.*
        multiline.key_content log
        multiline.parser java,go,python
    [FILTER]
        Name kubernetes
        Buffer_Size 512k
        Match kube.*
        Merge_Log On
        Merge_Log_Key log_json
        Merge_Log_Trim On
        Keep_Log On
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

....

Your Environment

  • Version used: Fluent Bit v1.8.12
  • Environment name and version (e.g. Kubernetes? What version?): v1.20.4-eks
  • Cpu and memory load after enabling the multiline filter. I tried increasing memory limit and emitter_buffer limit to a few GB and the processes were still crashing image

Additional context

Fluentbit container keeps crashing after it gets to the memory limit configured for that container. Also a lot of logs like

[error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag:

are flooding fluentbit logs

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 24
  • Comments: 47 (13 by maintainers)

Most upvoted comments

Hello everyone. I still get an error on version 2.1.9. I request assistance in resolving this issue. I’ve tried various configurations, and the problem occurred after upgrading from 1.8.6 to 2.1.x.

I must say this is strange, that no one seems to be looking into this. Anyway I suggest you either return to your original version for now , or upgrade to 2.0.6. I think this is the last version that doesnt have this issue.

I found there is a new buffer config parameter, set it to off seems to avoid this issue. doc: https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace#configuration-parameters

[FILTER]
    Name                             multiline
    match                            kube.*
    multiline.key_content   log
    multiline.parser            java
    buffer                            off

#8473 is the solution that should be used for this. I’ve closed my PR in favour of it.

I took a look at this issue today.

The root problem: the internal in_emitter plugin that the multiline filter uses to emit the joined logs does not respect the FLB_INPUT_PAUSED state. This state is set and checked automatically in flb_input_chunk.c, and those routines will defensively refuse to append to chunks if the plugin is paused. This is not a problem for most input plugins, as they will simply pause and resume their collectors as the overall input instance is paused. This is not the case for the emitter, where any plugin can just append records to it at any time. So the multiline plugin indiscriminately slams the emitter with new logs, causing the emitter plugin to fail to append most of these and drop the logs. Worth noting that this is different from the documented behaviour in the emitter_mem_buf_limit field The emitter mem buf limit description from the link above I actually have no idea how to implement this behaviour; I couldn’t find a way for a filter to wait for an input plugin to resume without blocking the entire program (thus deadlocking it essentially, since the emitter wouldn’t be able to process the logs it has to make room so it can’t ever resume).

This problem exists for both multiline and rewrite_tag, as their usage of emitter (just adding records directly to it) is the same.

I opened a PR shown above that cuts out the middleman and checks for the emitter to be paused before even trying to write to it. This will at least stop the spamming error log. Technically this issue is about high CPU, and this PR does not solve it. The multiline processor is very CPU intensive in general, so when it’s run as a filter on the main thread it’s no wonder it takes a whole CPU. I’m not sure what the solution to that is.


As a user, it is basically necessary with the design of this plugin to take steps to make the plugin resilient to higher throughputs. This is actually relatively the same as what you’d need to do to any input plugins, it’s just a bit odd because you wouldn’t think of needing to use those kinds of strategies on a filter plugin. However, to use this plugin effectively, it is essentially necessary to understand that under-the-hood detail of the emitter also being an input plugin which can suffer similar problems to any other input plugin.

Here are the solutions I would recommend, which are not dissimilar to any input plugin:

  • As suggested above, increasing the memory buf size for the emitter is a good first step. With a larger memory buffer, it will be less likely to get paused and thus theoretically drop less logs.
  • You could switch the emitter to filesystem buffering mode instead of memory. Filesystem buffering is designed for the scenario where you want to make sure you keep extra logs while the input plugin is busy. In this case, the plugin is even less likely to pause; it pauses on the storage.max_chunks_up amount. These numbers all depend on what you set for the global storage settings in [SERVICE] (side note: it would be great to be able to set these for the emitter). Read more at https://docs.fluentbit.io/manual/administration/buffering-and-storage
  • If you are on a later version of Fluent Bit, you can run the input plugin threaded and run the multiline filter as a processor. This would make it so the multiline filter doesn’t take up all the air in the room by clogging up the main engine thread. This is only possible if you use YAML configs I think, and it might not be possible based on how you’ve architected your pipeline. Read more at https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/yaml/configuration-file#config_processors

We are experiencing exactly the same issue, with the main problem being that we actually need the functionality of the previously mentioned PR for the multiline logs. We use the amazon/aws-for-fluent-bit and run it next to our own containers using the built-in integration with firelens [2022/03/07 10:10:52] [error] [input:emitter:emitter_for_multiline.2] error registering chunk with tag: xxx-firelens-GUID The container has 50MB of allocated memory and it used to be more that enough, when now it used up the memory in 3 days and started just returning errors.

@PettitWesley could you have a look if there are any issues in your implementation or if there are any workarounds?

I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10

Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version. I wish someone would look into this already.

Hi Ryan, I confirm that 2.0.9 version doesn’t have the CPU issue.

I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10

Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version. I wish someone would look into this already.

I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 #7782 I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.

Same issue: upgraded to 2.1.2 from 2.0.9 immediately high CPU, pod restarts and “error registering chunk with tag” messages. Issue is gone once downgrading back to 2.0.9. I have just one multiline [FILTER] with custom [MULTILINE_PARSER] in the config. “buffer off” eliminates errors on version 2.1.2, but multiline filter stops to work with “buffer off” configured in [FILTER] despite used with tail input.

Hi terence. Thanks for the update. Indeed you are right , I just checked 2.0.9 with my application and cpu is Ok. So seems the regression started after 2.0.9. Im really surprised no one is looking at this yet. I would say this is a critical bug. https://github.com/fluent/fluent-bit/issues/7782

I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 https://github.com/fluent/fluent-bit/issues/7782

I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.

For anyone facing similar issues, we tried a lot of different configurations including increasing memory limits but it didn’t work. In the end, we had to replace fluent-bit with fluentd to make our logging work with the multiline parser.

I know it’s ugly, but since i really need this working, i did setup a cronjob to rollout restart the fluent-bit pods every day at 03:00am.

I have added this on my personal backlog, which unfortunately right now is very long.

My recommendation is to take a look at past issues involving the rewrite_tag filter, with the new code, the multiline filter with buffer is very similar to rewrite_tag. This same issue should have arisen with that filter as well.