fluent-bit: Fluentbit is stuck, flb chunks are not flushed and not sending data to ElasticSearch
Bug Report
Describe the bug Fluent is sending data to elasticsearch and it works fine for sometime and then its stuck and not sending data to elastic. On checking fluentbit pod logs, Briefly I see below 3 issues (Attached full logs below):
- [ info] [task] re-schedule retry=0x7f6a1ecc7b68 135 in the next 8 seconds
- [ warn] [engine] failed to flush chunk '1-1648141166.449162875.flb', retry in 8 seconds: task_id=517, input=storage_backlog.1 > output=es.0 (out_id=0)
- [debug] [input:tail:tail.0] 0 new files found on path '/var/log/kube-apiserver-audit.log'
- (even though the file '/var/log/kube-apiserver-audit.log' exists inside a pod )
Referred similar issues, still no luck:
- As per https://github.com/fluent/fluent-bit/issues/3581#issuecomment-1081107592 I have tried to use 1.9.1-debug version as well, still seeing same results
- https://github.com/fluent/fluent-bit/issues/4336
To Reproduce
- start fluentbit as Daemonset on K8S and add node affinity to run only on master node
- Use tail plugin to collect API server Audit Logs (Generated on path
/var/log/kube-apiserver-audit.log
) - Please refer configs mentioned below in Configuration section
After that Seeing below logs frequently in fluentbit pod:
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351188.130831321.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351188.130831321.flb
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecc7b68 135 in the next 8 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecd7478 191 in the next 10 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecb0b48 104 in the next 10 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed02588 85 in the next 7 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed02498 64 in the next 11 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed022e0 30 in the next 11 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1eccae08 147 in the next 9 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecb6c28 28 in the next 9 seconds
....
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] scanning path /var/log/kube-apiserver-audit.log
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/kube-apiserver-audit.log, inode 1882
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/kube-apiserver-audit.log'
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] inode=1882 events: IN_MODIFY
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648635501.574187623.flb mapped OK
[2022/03/30 10:18:38] [debug] [storage] [cio file] alloc_size from 1114112 to 1245184
[2022/03/30 10:18:38] [debug] [storage] [cio file] synced at: tail.0/1-1648635501.574187623.flb
[2022/03/30 10:18:38] [debug] [input chunk] update output instances with new chunk size diff=128565
[2022/03/30 10:18:38] [debug] [retry] new retry created for task_id=517 attempts=1
[2022/03/30 10:18:38] [ warn] [engine] failed to flush chunk '1-1648141166.449162875.flb', retry in 8 seconds: task_id=517, input=storage_backlog.1 > output=es.0 (out_id=0)
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351224.176425521.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351224.176425521.flb
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351264.88207341.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351264.88207341.flb
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351301.408233326.flb mapped OK
On checking further inside fluentbit pod, I see chunks are not cleared and filling up the disk:
root@fluent-bit-audit-drbxj:/var/log# du -sh flb-storage/tail.0/
26G flb-storage/tail.0/ <------
root@fluent-bit-audit-qbkw8:/var/log# ls -la kube-apiserver-audit*
-rw------- 1 root root 104857010 Mar 31 08:19 kube-apiserver-audit-2022-03-31T08-19-39.831.log
-rw------- 1 root root 95179025 Mar 31 08:46 kube-apiserver-audit.log
Expected behavior
- Keep sending data to ElasticSearch, should not be stuck
- Should be able to flush chunk files(
*.flb
files) regularly (in my case directory is/var/log/flb-storage/tail.0/
Your Environment
- Version used:
1.8.15
and1.9.1-debug
(tested on both) - Configuration (fluentbit configuration):
# Configuration files: server, input, filters and output
# ======================================================
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level debug
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 50M
@INCLUDE input-audit.conf
@INCLUDE output-elasticsearch.conf
input-audit.conf: |
[INPUT]
Name tail
Tag kube-audit.*
Path /var/log/kube-apiserver-audit.log
Parser json
DB /var/log/flb_kube_audit.db
storage.type filesystem
Mem_Buf_Limit 800MB
Skip_Long_Lines On
Refresh_Interval 10
Buffer_Chunk_Size 4096k
Buffer_Max_Size 9000k
output-elasticsearch.conf: |
[OUTPUT]
Name es
Match kube-audit.*
Host ${KUBE_ELASTICSEARCH_HOST}
Port ${KUBE_ELASTICSEARCH_PORT}
Index kube-audit-dev
Buffer_Size False
Generate_ID on
AWS_Auth On
AWS_Region us-east-1
Replace_Dots On
Retry_Limit False
tls On
Trace_Error On
Logstash_Format On
Logstash_Prefix kube-audit-dev
parsers.conf: |
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.22.4
- Server type and version: Running as docker image on Kubernetes master node (using docker image:
1.9.1-debug
- Operating System and version: Linux
- Filters and plugins: Plugin:
tail, es
Additional context
- I am running a separate copy of fluentbit as daemonset (which runs only on master node) and collect API server Audit Logs (Generated on path
/var/log/kube-apiserver-audit.log
) - Initially, its working fine. After some time, chunks are piling up (failed to flush) and I see retry messages in log and fluentbit stops sending data to ElasticSearch.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 14
- Comments: 57 (10 by maintainers)
Not stale
Mappings conflict is one potential cause of this issue, but it is not the only one. If you set your fluent-bit to a limited number of retries (we set it to 10) the mapping conflicts eventually get dropped and they are not going to perpetually retry. In the scenario I described in my previous post, the case I described where elastic stops accepting ingestion, still occurs.
Hi I got same issue, and resolved the problem by deleting *.flb.
To explain my situation in detail, I was using aws-for-fluent-bit version 2.10.0 (fluent-bit 1.6.x) and the crash continued to occur, so I upgraded to 2.25.1 (fluent-bit 1.9.3). I wanted to use the stable version of aws-for-fluent-bit, but in the stable version, an error such as invalid back log continued to occur, so I upgraded to the 1.9.x version that can skip the error.
And my fluent-bit situation is like the picture below. The pod with the error has unusually high memory(6xx ~ 8xx)
And the number of flb files is unusually high. (hundred thousands ~ miliions)
The reason why so many flb files could not be flushed is guessing that the crash occurred in 1.6.x. Therefore, it seems to be a problem that occurs when there are many flb files.
the issue for us has not been resolved with 2.0.8. So what I haven’t figured out is how to reliably reproduce this. What happens is
last night elastic was stuck in read only for about 5 hours due to filesystem full (non critical system so we don’t have any paging weekend/overnight expectations)
After fixing teh elastic problem, existing fluent-bit pods were all stuck on retry with no output records being processed. After restarting fluent-bit it will work through but one of the issues is, our fluent-bit throughput is considerably faster than elastic can process (6000+records per second) which results in elastic throttling (http 429 resonse codes). This can also eventually get fluent-bit stuck as it’s unable to send requests ( see step 3 above - i have not been able to identify a trigger, is it time based? some other criteria?) and the only fix i’ve found is to restart the fluent-bit pods
The symptoms for fluent bit are stuck with logging like this
We’ve been running fluent-bit 2.0.8 for about a month now and things seem more stable. We haven’t had a catastropihic outage of elastic that seems to trigger the rproblem before (elastic being down for hours at a time) but we have had elastic being down for 20-30 minutes or so at a time and so far we haven’t seen fluent-bit 2.0.8 get stuck. We’re planning on trying to implement a failure situation - shutdown elastic, or block it’s port from the client - for a few hours at a time to see if the issue still occurs but it’s a little lower on our priority list to try.
Hello @awangptc,
flb files are no longer retried and are deleted from the storage.
Are you all using the latest image? We’ve recently (last month) merged some fixes related to irrecoverable chunks handling.
I’m having a hard time collecting the logs when this occurs with debug on. The issue is, when the error condition occurs, our fluent-bit logs are no longer being sent to elatic. But the sheer volume of debug logs causes our k8s nodes to rotate the logs more quickly than I can pull them. I’m sure their in flb files but I don’t know how to extract them.
The scenario that causes the problem for us is, something cause elastic search to start reject requests - we have seen two conditions that cause this
when these occur, fluent-bit stops sending logs to elastic but continues to read from tail into new flb files.
If we restart fluent-bit it attempts to reprocess the flb files, but if there are multiple nodes, restarting all of the fluent-bit pods will very very quickly overwhelm elastic adn elastic will start to throttle… 429s.
This is resulting in a high retry/failure rate. We have our retry limit set to 10. If a limit is reached what happens to the flb files? Do they get removed for that transaction? If not, that might be part of our issue. I’ve discovered that if i rotate each fluent-bit pod ONE at a time, elastic does not get overloaded and it is able to work through the backlog flb files. But that’s a time consuming process of course.
@lecaros
One of the problems I have with this is that it occurs only when elastic has issues. And we have not been able to figure out how to force the issue. It could run for days or weeks without seeing the problem, and when elastic has issues that causes rejections this occurs. It’s difficult to run fluent-bit with debug log enabled for the length of time necessary to reproduce this. It does appear to occur when elastic prevents ingestion for some significant period of time (at least a few hours). Is there anything that you can tell us about how fluent-bit handles retrying flb chunks? It’s interesting that the workarounds above of deleting older flb files seems to unblock things. Is there anything from that that makes sense for us to gather more targetted debugging?