fluent-bit: Fluentbit consumes 100% cpu and logs errors "error registering chunk" with multiline parser with versions 2.1.x (regression - from v1.9.7)
I have a fluentbit/fluentd configuration that has been working pretty well. (version 1.9.7). Input using tail with no mem limits. After upgrading from 1.9.7 to 2.1.2 , we noticed that in high log rates (around 1000 log lines a second) , fluentbit reaches 100% cpu after several minutes, and starts to log many errors. Once this happens , it doesnt stop logging many errors each second . (its easily reproducible locally). " [input:emitter:emitter_for_multiline.4] error registering chunk" I also tried 2.1.8 and same problem. After returning to original version 1.9.7 , cpu was around 5% and no errors. I Tested this several times locally and it is consistent.
At first I got errors related to the rewrite_tag filter I used, so I thought maybe the rewrite had an issue so I removed this filter and now I get errors related to multiline so I assume its a much more basic issue that is not relevant to a specific filter. Once fluentbit starts writing these errors , there is no stopping and fluent constantly logs many such error lines.
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
I would guess this issue occurs from 2.1.x
To reproduce. I ran fluentbit / fluentd locally , with multiline parser filters, and many different types of mock components to reproduce logs at a high rate.
I have serveral Multiline parsers for different components , but they all more or less look like this one below . I assume though that any parser will do.
[INPUT]
Name tail
Path /var/log/containers/*srl-*.log
Key log
Refresh_Interval 10
Tag kubelrc.<container_name>.<namespace_name>.<pod_name>
Tag_Regex (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
Parser cri
[FILTER]
name multiline
match kubelrc.srl-tdigest*.*
#Must set the field that contains the message to parse
multiline.key_content log
multiline.parser multiline-tdigest
[MULTILINE_PARSER]
name multiline-tdigest
type regex
# TDigest state machine
#lines starts with dddd-dd-dd (we assume its a timestamp)
#any lines after that that DO NOT match dddd-dd-dd will be considered part of the previous line.
#NOTE - this can cause a problem for example in restart of tdigest the first logs that have no timestamp will be appeneded to the last log with timestamp before restart.
# We could consider being more strict with the cont rule and specify what to expect after new line
# rules | state name | regex pattern | next state name
# --------|----------------|--------------------------------------------------
rule "start_state" "/^(([0-9]{4}-[0-9]{2}-[0-9]{2})|IMAGE_TAG)/" "cont"
rule "cont" "/^(?!(([0-9]{4}-[0-9]{2}-[0-9]{2})|IMAGE_TAG))/" "cont"
#rule "cont" "/^(\s+at).*/" #Java exception only
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 5
- Comments: 15 (2 by maintainers)
New Update: Seems that the issue started after 2.0.9. Because 2.0.9 Does not have this problem. and we know that 2.1.2 does. I know I am not the only one with this issue , As specified by others here : https://github.com/fluent/fluent-bit/issues/4940
Hi Leonardo Actually I already checked that before , but just to make sure , I checked this right now , and adding “buffer off” to the ml parsers does not solve the 100% cpu issue. The problem remains.
Hi Can someone please look into this. Its definitely an issue that requires attention as it is a regression from 2.1.x (at least I know for sure its from 2.1.2 and continues in 2.1.8) Last week we returned to version 1.9.7 and everything is working fine . so the problem we had was definitely due to the upgrade .