fluent-bit: rewrite_tag with filesystem storage takes huge amount of space on disk, causing disk pressure and eviction

Bug Report

Describe the bug I have fluent-bit configured like this:

    [SERVICE]
        Flush                     5
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               On
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        coro_stack_size           245760
        storage.path              /flb-logs/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 50M

rewrite_tag filter

    [FILTER]
        Name                  rewrite_tag
        Match                 stackdriver.tcp.all.*
        Emitter_Storage.type  filesystem
        Rule                  $app ^(.*)$  stackdriver.tcp.$app.$version.$podName false
        Emitter_Name          app-version-pod-emitted

After a few days (2-3) with log-ingestion rate of 100MB/sec. I can see that the storage occupied on disk from the emitter is huge: 30-50GB.

/ # du -h /flb-logs
55.7G	/flb-logs/flb-storage/emitter.13
55.7G	/flb-logs/flb-storage
55.7G	/flb-logs

I also checked one 26h pod and the size used there was quite smaller.

/ # du -h /flb-logs
293.0M	/flb-logs/flb-storage/emitter.13
293.0M	/flb-logs/flb-storage
293.0M	/flb-logs

At some point, pods on k8s are evicted because of Disk pressure, which is causing logs loss.

It seems like in some scenarios emitter storage is not cleaned up. Even if output cannot process records for some time, 55GB @ 100MBs rate is ~10min.

To Reproduce I don’t have steps to reproduce just yet, because it’s observed on the prod environment under high load.

Expected behavior My expectation is that the storage used by the emitter will be a reasonably small (not more than a few GBs) and if there is any problem being cleaned up, some logs should indicate what problem is causing it. Disk pressure, causes immediate pod eviction on k8s and logs are lost. This is not acceptable for a prod environment, please advise if this is a configuration issue.

Your Environment

Fluent-bit version used: 1.7.7
K8s version: 1.17.17 on GKE
Configuration: fluent-bit pods requesting 0.4 cpu and 200Mi RAM
Filters and plugins: tcp and forward inputs + lua and rewrite_tag filters + stackdriver output.

Additional context We’ve been with a very old customised version (1.0.1) for 2 years and now we changed to 1.7.7 and added the the tag rewriting. The environment is not very stable. Pods are being evicted, we experience some log drops + stackdriver output plugin loses connection much more frequently. This is too risky for the production environment.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (16 by maintainers)

Most upvoted comments

This GitHub Actions bot really doesn’t help with closing these issues that are still valid and actively occurring…

davpate on Apr 12, 2023