fluent-bit: FluentBit spam itself with "error registering chunk with tag"
Bug Report
I see these errors in the aggregator few seconds after it starts. Usually, I see this error after reaching Emitter_Mem_Buf_Limit.
Our forwarder also tail fluentbit logs which exacerbates the problem. These errors will be tailed and re-forwarded to the aggregator which then again generate the same error, and the cycle continues. I don’t see how to repro this, but is there any feature to suppress error messages, for example, don’t generate this error more than x times per min?
To Reproduce
2022-07-20 18:43:21.7122020 | [2022/07/20 18:43:21] [ info] [fluent bit] version=1.9.4, commit=08de43e474, pid=1
2022-07-20 18:43:21.7122250 | [2022/07/20 18:43:21] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
2022-07-20 18:43:21.7122330 | [2022/07/20 18:43:21] [ info] [cmetrics] version=0.3.1
2022-07-20 18:43:21.7123510 | [2022/07/20 18:43:21] [ info] [input:forward:input.forward] listening on 0.0.0.0:24224
2022-07-20 18:43:21.7141520 | [2022/07/20 18:43:21] [ info] [output:forward:forward.mdsd] worker #0 started
2022-07-20 18:43:21.7357780 | [2022/07/20 18:43:21] [ info] [output:forward:forward.mdsd] worker #1 started
2022-07-20 18:43:21.7358770 | [2022/07/20 18:43:21] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
2022-07-20 18:43:21.7358880 | [2022/07/20 18:43:21] [ info] [sp] stream processor started
2022-07-20 18:43:56.7986370 | [2022/07/20 18:43:56] [error] [input:emitter:re_emitted.container_log] error registering chunk with tag: mdsd.container.log
2022-07-20 18:43:56.7986420 | [2022/07/20 18:43:56] [error] [input:emitter:re_emitted.container_log] error registering chunk with tag: mdsd.container.log
2022-07-20 18:43:56.7986500 | [2022/07/20 18:43:56] [error] [input:emitter:re_emitted.container_log] error registering chunk with tag: mdsd.container.log
2022-07-20 18:43:56.7986540 | [2022/07/20 18:43:56] [error] [input:emitter:re_emitted.container_log] error registering chunk with tag: mdsd.container.log
...
- Steps to reproduce the problem:
- N/A
Expected behavior FluentBit should not generate the same error messages more than n times per sec/min.
Screenshots
Your Environment
- Version used: 1.9.4
- Configuration:
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.22.11
- Server type and version:
- Operating System and version:
- Filters and plugins: tail, kubernetes, rewrite_tag, lua, forward
Additional context
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 6
- Comments: 48 (10 by maintainers)
We ran into the same issue with systemd multiline parsing using 1.9.10:
Perhaps caused by a large batch of unsent logs due to an output misconfiguration. It is extremely concerning that Fluent Bit fails in this manner.
Hey folks, an update here we have merged a “Log Suppression feature” https://github.com/fluent/fluent-bit/pull/6435 which should be released soon. This should help with errors that continue to show up
I’ve been hitting this issue as well in k8s where the fluent-bit pod frequently becomes OOM and crashes.
On a 30 node cluster, with fluent-bit deployed as a daemonset, there was only one fluent-bit pod repeatably crashing. That pod was on the same node as a very log spammy pod. To test it was the spammy pod, I excluded it and the fluent-bit pod from the path in the tail config. That stopped the crashing, but that’s hardly a fix, as I want those logs as well.
I managed to replicate the errors on a testing environment, where I had to
Then I got the error logs right away the tens of thousands
Although I didn’t manage to crash it on the testing environment, it was able to make the error message.
The problem I had on the large k8s cluster seems to have fixed on it’s own the the components got redeployed and there wasn’t a massive backlog of logs for the fluent-bit pods to process.
Hope this can help anyone else.
@vwbusguy recognized the death spiral description as being caused by Fluent Bit both reading to the systemd journal and also reading from it… which happens to be my configuration. In practice, most of the time that configuration works, but once Fluent Bit starts spamming its own logs at some point a threshold is crossed and Fluent Bit can’t keep up with processing as input the logs it’s generating as output…
So I’ll update my configuration to break that loop as a workaround, while this issue can stay focused on preventing Fluent Bit from generating the same log entries repeatedly.
I built from master and deployed to a k8s cluster, and it did not seem to have any affect on the “error registering chunk with tag”.
How does one go about estimating the precise value for
emitter_mem_buf_limit
. Unlike the mem_buf_limits and chunk size, there’s corresponding Prometheus metrics that can help estimate these values. Is there something similar for the emitter? If not, is something like this possible?Adjusting the
emitter_mem_buf_limit
worked in my case.@srikanth-burra I have an estimation calculation here: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention
I had the same problem when. Problems were issued when emitter_mem_buf_limit exceeded. I’ve changed
also, add workers to the output
I’ve seen this happen when the emitter_mem_buf_limit is exceeded
Hello @agup006 as @stebbib has mentioned, this option only acts on messages from output plugins that look similar within an interval of time.
We have created a public FR https://github.com/fluent/fluent-bit/issues/6873 to extend this functionality to input plugins, and other fluent-bit components such as storage, engine, etc.
@markstos Thanks for assisting in that issue, and understand we all have to keep our services up and running 😃. All the best and appreciate your contributions
I’m going to be evaluating Vector instead. This is a critical issue.
https://vector.dev/
Mark Stosberg (he/him)
Director of Systems & Security
@.*** | 765.277.1916
https://www.rideamigos.com https://rideamigos.com/
Changing the way the world commutes.
https://www.linkedin.com/company/rideamigos https://www.twitter.com/rideamigos https://www.facebook.com/rideamigos https://www.instagram.com/rideamigos https://rideamigos.com/newsletter-sign-up/
It would be good to have a metric exposed when memory limits like
Emitter_Mem_Buf_Limit
are reached so people can alerts or act on that.