fluent-bit: Fluentbit is stuck, flb chunks are not flushed and not sending data to ElasticSearch

Bug Report

Describe the bug Fluent is sending data to elasticsearch and it works fine for sometime and then its stuck and not sending data to elastic. On checking fluentbit pod logs, Briefly I see below 3 issues (Attached full logs below):

- [ info] [task] re-schedule retry=0x7f6a1ecc7b68 135 in the next 8 seconds
- [ warn] [engine] failed to flush chunk '1-1648141166.449162875.flb', retry in 8 seconds: task_id=517, input=storage_backlog.1 > output=es.0 (out_id=0)
- [debug] [input:tail:tail.0] 0 new files found on path '/var/log/kube-apiserver-audit.log'  
- (even though the file '/var/log/kube-apiserver-audit.log' exists inside a pod )

Referred similar issues, still no luck:

To Reproduce

  • start fluentbit as Daemonset on K8S and add node affinity to run only on master node
  • Use tail plugin to collect API server Audit Logs (Generated on path /var/log/kube-apiserver-audit.log )
  • Please refer configs mentioned below in Configuration section

After that Seeing below logs frequently in fluentbit pod:

[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351188.130831321.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351188.130831321.flb
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecc7b68 135 in the next 8 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecd7478 191 in the next 10 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecb0b48 104 in the next 10 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed02588 85 in the next 7 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed02498 64 in the next 11 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed022e0 30 in the next 11 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1eccae08 147 in the next 9 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecb6c28 28 in the next 9 seconds

....

[2022/03/30 10:18:38] [debug] [input:tail:tail.0] scanning path /var/log/kube-apiserver-audit.log
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/kube-apiserver-audit.log, inode 1882
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/kube-apiserver-audit.log'
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] inode=1882 events: IN_MODIFY 
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648635501.574187623.flb mapped OK
[2022/03/30 10:18:38] [debug] [storage] [cio file] alloc_size from 1114112 to 1245184
[2022/03/30 10:18:38] [debug] [storage] [cio file] synced at: tail.0/1-1648635501.574187623.flb
[2022/03/30 10:18:38] [debug] [input chunk] update output instances with new chunk size diff=128565
[2022/03/30 10:18:38] [debug] [retry] new retry created for task_id=517 attempts=1
[2022/03/30 10:18:38] [ warn] [engine] failed to flush chunk '1-1648141166.449162875.flb', retry in 8 seconds: task_id=517, input=storage_backlog.1 > output=es.0 (out_id=0)
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351224.176425521.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351224.176425521.flb
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351264.88207341.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351264.88207341.flb
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351301.408233326.flb mapped OK

On checking further inside fluentbit pod, I see chunks are not cleared and filling up the disk:

root@fluent-bit-audit-drbxj:/var/log# du -sh flb-storage/tail.0/
26G     flb-storage/tail.0/ <------

root@fluent-bit-audit-qbkw8:/var/log# ls -la kube-apiserver-audit*
-rw------- 1 root root 104857010 Mar 31 08:19 kube-apiserver-audit-2022-03-31T08-19-39.831.log
-rw------- 1 root root  95179025 Mar 31 08:46 kube-apiserver-audit.log

Expected behavior

  • Keep sending data to ElasticSearch, should not be stuck
  • Should be able to flush chunk files(*.flb files) regularly (in my case directory is /var/log/flb-storage/tail.0/

Your Environment

  • Version used: 1.8.15 and 1.9.1-debug (tested on both)
  • Configuration (fluentbit configuration):
  # Configuration files: server, input, filters and output
  # ======================================================
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     debug
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020
        storage.path              /var/log/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 50M
    @INCLUDE input-audit.conf
    @INCLUDE output-elasticsearch.conf

  input-audit.conf: |
    [INPUT]
        Name              tail
        Tag               kube-audit.*
        Path              /var/log/kube-apiserver-audit.log
        Parser            json
        DB                /var/log/flb_kube_audit.db
        storage.type      filesystem
        Mem_Buf_Limit     800MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        Buffer_Chunk_Size 4096k
        Buffer_Max_Size   9000k

  output-elasticsearch.conf: |
    [OUTPUT]
        Name            es
        Match           kube-audit.*
        Host            ${KUBE_ELASTICSEARCH_HOST}
        Port            ${KUBE_ELASTICSEARCH_PORT}
        Index           kube-audit-dev
        Buffer_Size     False
        Generate_ID     on
        AWS_Auth        On
        AWS_Region      us-east-1
        Replace_Dots    On
        Retry_Limit     False
        tls             On
        Trace_Error     On
        Logstash_Format   On
        Logstash_Prefix   kube-audit-dev

  parsers.conf: |
    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.22.4
  • Server type and version: Running as docker image on Kubernetes master node (using docker image: 1.9.1-debug
  • Operating System and version: Linux
  • Filters and plugins: Plugin: tail, es

Additional context

  • I am running a separate copy of fluentbit as daemonset (which runs only on master node) and collect API server Audit Logs (Generated on path /var/log/kube-apiserver-audit.log )
  • Initially, its working fine. After some time, chunks are piling up (failed to flush) and I see retry messages in log and fluentbit stops sending data to ElasticSearch.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 14
  • Comments: 57 (10 by maintainers)

Most upvoted comments

Not stale

The issue is conflicting mappings, during earlier experiments I think text got uploaded as source and now it’s trying to treat source as an object due to source.ip.

You should be able to repro using the configs I provided and change the dummy inputs so that they generate conflicting dynamic mappings, e.g.

[INPUT]
  Name dummy
  Tag api-firelens
  Samples 1
  Dummy {"log": "{\"foobar\": 123}"}

[INPUT]
  Name dummy
  Tag api-firelens
  Samples 1
  Dummy {"log": "{\"foobar\": \"text\"}"}

So this will first map “foobar” as integer, then send text which should fail.

Mappings conflict is one potential cause of this issue, but it is not the only one. If you set your fluent-bit to a limited number of retries (we set it to 10) the mapping conflicts eventually get dropped and they are not going to perpetually retry. In the scenario I described in my previous post, the case I described where elastic stops accepting ingestion, still occurs.

Hi I got same issue, and resolved the problem by deleting *.flb.

To explain my situation in detail, I was using aws-for-fluent-bit version 2.10.0 (fluent-bit 1.6.x) and the crash continued to occur, so I upgraded to 2.25.1 (fluent-bit 1.9.3). I wanted to use the stable version of aws-for-fluent-bit, but in the stable version, an error such as invalid back log continued to occur, so I upgraded to the 1.9.x version that can skip the error.

And my fluent-bit situation is like the picture below. The pod with the error has unusually high memory(6xx ~ 8xx) Screen Shot 2022-06-01 at 1 09 58 AM

And the number of flb files is unusually high. (hundred thousands ~ miliions) Screen Shot 2022-06-01 at 1 08 40 AM

The reason why so many flb files could not be flushed is guessing that the crash occurred in 1.6.x. Therefore, it seems to be a problem that occurs when there are many flb files.

the issue for us has not been resolved with 2.0.8. So what I haven’t figured out is how to reliably reproduce this. What happens is

  1. something causes elastic to stop accepting ingestion
  2. fluent bit buffers to storage unable to send to elastic
  3. over some period of time … this is what i haven’t been able to determine
  4. we fix the elastic problem
  5. fluent-bit is stuck in “retry”

last night elastic was stuck in read only for about 5 hours due to filesystem full (non critical system so we don’t have any paging weekend/overnight expectations)

After fixing teh elastic problem, existing fluent-bit pods were all stuck on retry with no output records being processed. After restarting fluent-bit it will work through but one of the issues is, our fluent-bit throughput is considerably faster than elastic can process (6000+records per second) which results in elastic throttling (http 429 resonse codes). This can also eventually get fluent-bit stuck as it’s unable to send requests ( see step 3 above - i have not been able to identify a trigger, is it time based? some other criteria?) and the only fix i’ve found is to restart the fluent-bit pods

The symptoms for fluent bit are stuck with logging like this

[2023/03/26 17:36:59] [ info] [task] re-schedule retry=0x7f41f0bce430 1204 in the next 1 seconds
[2023/03/26 17:36:59] [ info] [task] re-schedule retry=0x7f41f0bcf4a0 1205 in the next 1 seconds

We’ve been running fluent-bit 2.0.8 for about a month now and things seem more stable. We haven’t had a catastropihic outage of elastic that seems to trigger the rproblem before (elastic being down for hours at a time) but we have had elastic being down for 20-30 minutes or so at a time and so far we haven’t seen fluent-bit 2.0.8 get stuck. We’re planning on trying to implement a failure situation - shutdown elastic, or block it’s port from the client - for a few hours at a time to see if the issue still occurs but it’s a little lower on our priority list to try.

Hello @awangptc,

If a limit is reached what happens to the flb files?

flb files are no longer retried and are deleted from the storage.

Are you all using the latest image? We’ve recently (last month) merged some fixes related to irrecoverable chunks handling.

I’m having a hard time collecting the logs when this occurs with debug on. The issue is, when the error condition occurs, our fluent-bit logs are no longer being sent to elatic. But the sheer volume of debug logs causes our k8s nodes to rotate the logs more quickly than I can pull them. I’m sure their in flb files but I don’t know how to extract them.

The scenario that causes the problem for us is, something cause elastic search to start reject requests - we have seen two conditions that cause this

  1. excessive load resulting in elastic sending back 429s to requests
  2. elastic ran out of disk space

when these occur, fluent-bit stops sending logs to elastic but continues to read from tail into new flb files.

If we restart fluent-bit it attempts to reprocess the flb files, but if there are multiple nodes, restarting all of the fluent-bit pods will very very quickly overwhelm elastic adn elastic will start to throttle… 429s.

This is resulting in a high retry/failure rate. We have our retry limit set to 10. If a limit is reached what happens to the flb files? Do they get removed for that transaction? If not, that might be part of our issue. I’ve discovered that if i rotate each fluent-bit pod ONE at a time, elastic does not get overloaded and it is able to work through the backlog flb files. But that’s a time consuming process of course.

@lecaros

Hi everyone, in order to address this, we need a reproducible scenario. If any of you have repro steps and can answer other questions posted to troubleshoot this, please provide them and we’ll gladly assist.

One of the problems I have with this is that it occurs only when elastic has issues. And we have not been able to figure out how to force the issue. It could run for days or weeks without seeing the problem, and when elastic has issues that causes rejections this occurs. It’s difficult to run fluent-bit with debug log enabled for the length of time necessary to reproduce this. It does appear to occur when elastic prevents ingestion for some significant period of time (at least a few hours). Is there anything that you can tell us about how fluent-bit handles retrying flb chunks? It’s interesting that the workarounds above of deleting older flb files seems to unblock things. Is there anything from that that makes sense for us to gather more targetted debugging?