fluent-bit: chunks are getting stuck

Bug Report

Describe the bug When I retry forever in the output plugins and the output destinations are unreliable, then chunks sometimes get stuck.

I see messages like

[engine] failed to flush chunk '1-1612396545.855856569.flb', retry in 1485 seconds: task_id=143, input=forward.0 > output=tcp.0

but sometimes this is the last thing I see of the chunk. If I send the CONT signal to fluentbit I see that fluentbit still has them.

To Reproduce I got currently a Java dummy application (if required, I can try to trim it down show I can share it easily) which starts 25 threads. Once those threads are all started, each thread sends 2570 log messages, no ack, each around 4500 bytes, to fluent-bit (via https://github.com/komamitsu/fluency). Between each message, each thread sleeps 10ms. The application is a webserver which logs on startup a couple of lines. The actual log messages I trigger via curl. This application runs on my laptop.

My fluent-bit config

[SERVICE]
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_PORT 2020
    Flush 2
    log_level TRACE
    storage.path /fluent-bit/log
    storage.metrics on

[INPUT]
    Name forward
    Listen 0.0.0.0
    Port 24225
    Buffer_Chunk_Size 64k
    Buffer_Max_Size 2M


[OUTPUT]
    Name   syslog
    Match  *

    host   TARGET_IP
    port   514
    mode   tcp

    syslog_format rfc5424
    syslog_message_key log
    Syslog_maxsize 65534
    syslog_appname_key logdest

    # retry forever
    retry_Limit False
    net.connect_timeout 30
    net.keepalive on
    net.keepalive_idle_timeout 10
    storage.total_limit_size 1G

My dockerfile

FROM fluent/fluent-bit:1.6.10
ADD fluent-bit.conf /fluent-bit/etc/fluent-bit.conf

Started via

docker build --no-cache -f Dockerfile -t ruwen-fluent-bit-bug .
docker run -p 24225:24225 -it ruwen-fluent-bit-bug:latest

on my laptop.

I connect to TARGET_IP via a VPN. TARGET_IP is a t2.micro with rsyslog.

Now the following is happening:

I connect to the VPN
rsyslog is already running on TARGET_IP
I start fluentbit on my laptop
I start the webserver on my laptop
I disconnect the vpn
I trigger the endpoint to send all the log messages
Once all messages are sent (the last few might still be buffered and send a second later), I reconnect to the VPN
Now fluentbit is successfully sending some of the messages. The little rsyslog box is sometimes a bit overwhelmed so I need more retries.
The VPN sometimes dies during this process, so I have to reconnect.
I am keeping an eye on the logs. Since the logs tell me when a retry is expected, I kinda know when to wait and when not. And I always wait some additional time.

This way I can reproduce the issue reliable. This produces the following log: fluentbit_syslog.log

I tried to reproduce it by running syslog locally and use iptables to sometimes drop packages and sometimes not. Without success. But I managed to reproduce the same problem with the tcp output plugin. I tried that to make sure that the problem is unrelated to the syslog output plugin.

The config is used for that:

[SERVICE]
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_PORT 2020
    Flush 2
    log_level TRACE
    storage.path /fluent-bit/log
    storage.metrics on

[INPUT]
    Name forward
    Listen 0.0.0.0
    Port 24225
    Buffer_Chunk_Size 64k
    Buffer_Max_Size 2M


[OUTPUT]
    Name   tcp
    Match  *

    host   TARGET_IP
    port   514
    format   json

    # retry forever
    retry_Limit False
    net.connect_timeout 30
    net.keepalive on
    net.keepalive_idle_timeout 10
    storage.total_limit_size 1G

I followed the same procedures as described. On my aws box I initially used nc, which stopped after each connection. Then I used it with the -k option, but that didn’t help. I ended up installed ncat via

ncat -l -k  514 > /dev/null

The whole restarting nc causes even more havoc. With the syslog test, I normally get 1 chunk stuck. Now I got even more. The resulting logfile. fluentbit_tcp.log

To see which chunks are missing, I wrote a little python script which parses those logs. It might be helpful.

#!/usr/bin/env python3

outstanding_chunks_to_error_msgs = dict()

with open('/tmp/fluentbit.log', 'r') as f:
    for line in f:
        line = line.strip()
        line = line.replace("\x1b[1m[\x1b[0m", "").replace("\x1b[1m]\x1b[0m", "")
        if 'failed to flush chunk' in line:
            chunk = line.split("'")[1]
            errors = outstanding_chunks_to_error_msgs.get(chunk)

            parts = line.split(" ")
            error_msg = f"{parts[0]} {parts[1]}: {parts[12]} {parts[13][:-1]}"
            if errors:
                errors.append(error_msg)
            else:
                outstanding_chunks_to_error_msgs[chunk]=[error_msg]
        elif 'flush chunk' in line and 'succeeded at retry' in line:
            chunk = line.split("'")[1]
            del outstanding_chunks_to_error_msgs[chunk]

for chunk, errors in outstanding_chunks_to_error_msgs.items():
    print(f"{chunk}: {errors}")

print(f"{len(outstanding_chunks_to_error_msgs)} missing")

Expected behavior Chunks being retried until they succeed.

Your Environment

Version used: Docker fluent/fluent-bit:1.6.10. I also tried Docker fluent/fluent-bit:1.7.0-rc3, without success. Since some log messages are now missing, tracing the issue is harder, see https://github.com/fluent/fluent-bit/issues/3001
Configuration: see reproduce section
Environment name and version: local laptop
Server type and version: 16 GB RAM, 8 Cores (Intel® Core™ i7-8650U CPU @ 1.90GHz), SSD
Operating System and version:

uname -a
Linux ruwenwork 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux

Filters and plugins: see reproduce section

Additional context How has this issue affected you? What are you trying to accomplish?

I am currently building a logging pipeline to use fluent-bit. My goal is to be able to ingest 256 MB/minute (that’s peak, I normally don’t need that) and to be able buffer data for a while when upstream is having issues. I am currently not imposing a memory limit, I didn’t get that far in my tests.

Please let me know if you need any further details. I am more than happy to build any other branch and try my luck with it.

Cheers Ruwen

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 4
Comments: 76 (32 by maintainers)

Most upvoted comments

I don’t understand why nobody takes care of this problem that seems evident is affecting a lot of users.

@edsiper @fujimotos @koleini could you please take a look?

pierluigilenoci on May 25, 2021

Hey folks, updating @elruwen gave us a really nice repro and we will be freeing up after fixing some of the odd DNS timeout issues in 1.8.7. Realistically, will have some more details after FluentCon NA (Oct 12). I’ll also bring up this issue in the community meeting Thursday Sept 30 @ 9:00 AM PST

agup006 on Sep 27, 2021

The same with 1.8.3, we stuck with 1.6.10 last working version

ivangrynkoring on Aug 6, 2021

Seeing the same issues still with 1.8.1

Kampe on Jul 16, 2021

Amazing job @leonardo-albertovich, both branches pass the tests!

The log messages to notify that the retry was successful are still missing (see https://github.com/fluent/fluent-bit/issues/3001).

And are you guys still interested in the tests? https://github.com/fluent/fluent-bit/pull/4162

elruwen on Mar 22, 2022

@elruwen thanks for running that test - when the chunk couldn’t be flushed are you seeing any error codes from the end output (Syslog server / Splunk or Elasticsearch) such as a 400 bad message error? Also do you have some information regarding the I/O on the machine as the chunks get stuck - looking for some patterns

@pierluigilenoci I am looking at this 😃 - a few folks are working on DNS TCP / UDP resolution improvements / IO improvements and chunk optimizations within the 1.8.x series. We do need to do a better job of showcasing what folks are working on and we welcome ideas from the community on how best to demonstrate that. I would ask if you can join our next community meeting (Last Thursday of each August) with suggestions on how best to view what folks are working on as well as ways that you could help contribute in that process.

agup006 on Aug 12, 2021

Anything else I can do so that this gets addressed? I am happy to create an end2end runnable test in java which can run as part of the build, but since I don’t know C, I would write it in Java. Not sure how keen you are to have Java in your project. I could try my luck in Go, but I never wrote something like that in Go before… 😉

elruwen on Jun 15, 2021

I used 1.7.6 with and without the workers statement. In both cases the result is the same and way worse than before. Before only a few chunks got stuck, no none of them of them are retried.

Previously a log message was emitted which said something like [engine] failed to flush chunk '1-1613013266.34830441.flb', retry in 9 seconds: task_id=0, input=tcp.0 > output=tcp.0 see https://github.com/fluent/fluent-bit/blob/master/src/flb_engine.c#L281

This log message is completely missing. I attached the logs from the run without workers being set. log-1.7.6.zip

elruwen on May 25, 2021

@agup006 any update?

elruwen on Oct 20, 2021

I just tried 1.8.3.

16 chunks missing (that is a record low)
all 16 chunks are on disk
I see error messages that a chunk can’t be flushed “[ warn] [engine] failed to flush chunk ‘1-1628553346.148642522.flb’, retry in 321 seconds: task_id=25, input=forward.0 > output=tcp.0 (out_id=0)” (which is good!)
I see generic error messages pointing to the underlying issue “[error] [plugins/out_tcp/tcp.c:96 errno=32] Broken pipe” I find that error message a bit confusing, because I would derive from it that the connection was established. It used to be “[error] [upstream] connection #41 to 192.168.188.29:5140 timed out after 30 seconds”, which is imho the better error message.
There is still no error message indicating that the chunk could be flushed, even though the log message exists in the code https://github.com/fluent/fluent-bit/blob/v1.8.3/src/flb_engine.c#L202 (see also issue #3001)

As said before, happy to help out further if I can somehow.

elruwen on Aug 10, 2021

I did that and my results are:

1.6.10: some arrive, some get stuck
1.7.4: some arrive, some get stuck
1.7.6: nothing arrives
1.7.7: some arrive, some get stuck

I ran tcpdump for 20 minutes after the last chunk arrived for 1.7.7, but no further traffic. I always got 26 or 27 chunks missing.

I was wondering if it is a problem with nc, so I wrote a small Java server. Exactly the same issue.

I removed storage, same problem exists.

elruwen on May 31, 2021

I encountered similar issues here https://github.com/fluent/fluent-bit/issues/3299 but even after i disable the retry forever the logs are not being sent anymore(old or new chunks) and they just gather in memory until the process crash due to OOM . The problem persists even with the latest version My only workaround was to increase Emitter_Mem_Buf_Limit for each filter to very high values (like a few GB) and pray that the buffer doesn’t get filled or any logs won’t be sent to ES anymore

staniondaniel on May 13, 2021

Problem still exists with 1.7.4. My other ticket (the one about the log message) is also still happening.

https://github.com/fluent/fluent-bit/blob/master/src/flb_engine.c#L198

I never see that message, even though I used to see it. Maybe that helps pointing in the right direction.

elruwen on May 11, 2021

I tested it the last time with 1.7.3. Still an issue.

elruwen on May 9, 2021

@pierluigilenoci Sorry mb. The code has been merged into master by maintainer https://github.com/fluent/fluent-bit/commit/1938ec845255dd27a1b0ad39e1654b3b351277be

JeffLuoo on Mar 25, 2021