fluent-bit: Fluentbit consumes 100% cpu and logs errors "error registering chunk" with multiline parser with versions 2.1.x (regression - from v1.9.7)

I have a fluentbit/fluentd configuration that has been working pretty well. (version 1.9.7). Input using tail with no mem limits. After upgrading from 1.9.7 to 2.1.2 , we noticed that in high log rates (around 1000 log lines a second) , fluentbit reaches 100% cpu after several minutes, and starts to log many errors. Once this happens , it doesnt stop logging many errors each second . (its easily reproducible locally). " [input:emitter:emitter_for_multiline.4] error registering chunk" I also tried 2.1.8 and same problem. After returning to original version 1.9.7 , cpu was around 5% and no errors. I Tested this several times locally and it is consistent.

At first I got errors related to the rewrite_tag filter I used, so I thought maybe the rewrite had an issue so I removed this filter and now I get errors related to multiline so I assume its a much more basic issue that is not relevant to a specific filter. Once fluentbit starts writing these errors , there is no stopping and fluent constantly logs many such error lines.

[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm

I would guess this issue occurs from 2.1.x

To reproduce. I ran fluentbit / fluentd locally , with multiline parser filters, and many different types of mock components to reproduce logs at a high rate.

I have serveral Multiline parsers for different components , but they all more or less look like this one below . I assume though that any parser will do.

[INPUT]
    Name tail
    Path /var/log/containers/*srl-*.log
    Key log
    Refresh_Interval 10
    Tag               kubelrc.<container_name>.<namespace_name>.<pod_name>
    Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Parser cri


[FILTER]
    name                    multiline
    match                   kubelrc.srl-tdigest*.*
    #Must set the field that contains the message to parse
    multiline.key_content   log
    multiline.parser        multiline-tdigest


[MULTILINE_PARSER]
    name          multiline-tdigest
    type          regex
    # TDigest state machine
    #lines starts with dddd-dd-dd (we assume its a timestamp)
    #any lines after that that DO NOT match dddd-dd-dd will be considered part of the previous line.
    #NOTE - this can cause a problem for example in restart of tdigest the first logs that have no timestamp will be appeneded to the last log with timestamp before restart.
    # We could consider being more strict with the cont rule and specify what to expect after new line
    # rules   |   state name   | regex pattern                   | next state name
    # --------|----------------|--------------------------------------------------
    rule         "start_state"   "/^(([0-9]{4}-[0-9]{2}-[0-9]{2})|IMAGE_TAG)/"  "cont"
    rule         "cont"          "/^(?!(([0-9]{4}-[0-9]{2}-[0-9]{2})|IMAGE_TAG))/"    "cont"
    #rule         "cont"          "/^(\s+at).*/" #Java exception only

About this issue

Original URL
State: closed
Created a year ago
Reactions: 5
Comments: 15 (2 by maintainers)

Most upvoted comments

New Update: Seems that the issue started after 2.0.9. Because 2.0.9 Does not have this problem. and we know that 2.1.2 does. I know I am not the only one with this issue , As specified by others here : https://github.com/fluent/fluent-bit/issues/4940

ryan65 on Aug 17, 2023

That bit about setting buffer to off alleviating the issue mentioned by terrencebor sounds quite interesting, do you think you could give it a try? Not as a workaround suggestion but to confirm that it’s a lead. That would make it much easier for the team to address it since we are quite limited time wise.

Thank you for taking the time to report the issue and for the patience.

Hi Leonardo Actually I already checked that before , but just to make sure , I checked this right now , and adding “buffer off” to the ml parsers does not solve the 100% cpu issue. The problem remains.

ryan65 on Aug 17, 2023

Hi Can someone please look into this. Its definitely an issue that requires attention as it is a regression from 2.1.x (at least I know for sure its from 2.1.2 and continues in 2.1.8) Last week we returned to version 1.9.7 and everything is working fine . so the problem we had was definitely due to the upgrade .

ryan65 on Aug 6, 2023