fluent-bit: chunks are getting stuck
Bug Report
Describe the bug When I retry forever in the output plugins and the output destinations are unreliable, then chunks sometimes get stuck.
I see messages like
[engine] failed to flush chunk '1-1612396545.855856569.flb', retry in 1485 seconds: task_id=143, input=forward.0 > output=tcp.0
but sometimes this is the last thing I see of the chunk. If I send the CONT signal to fluentbit I see that fluentbit still has them.
To Reproduce I got currently a Java dummy application (if required, I can try to trim it down show I can share it easily) which starts 25 threads. Once those threads are all started, each thread sends 2570 log messages, no ack, each around 4500 bytes, to fluent-bit (via https://github.com/komamitsu/fluency). Between each message, each thread sleeps 10ms. The application is a webserver which logs on startup a couple of lines. The actual log messages I trigger via curl. This application runs on my laptop.
My fluent-bit config
[SERVICE]
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
Flush 2
log_level TRACE
storage.path /fluent-bit/log
storage.metrics on
[INPUT]
Name forward
Listen 0.0.0.0
Port 24225
Buffer_Chunk_Size 64k
Buffer_Max_Size 2M
[OUTPUT]
Name syslog
Match *
host TARGET_IP
port 514
mode tcp
syslog_format rfc5424
syslog_message_key log
Syslog_maxsize 65534
syslog_appname_key logdest
# retry forever
retry_Limit False
net.connect_timeout 30
net.keepalive on
net.keepalive_idle_timeout 10
storage.total_limit_size 1G
My dockerfile
FROM fluent/fluent-bit:1.6.10
ADD fluent-bit.conf /fluent-bit/etc/fluent-bit.conf
Started via
docker build --no-cache -f Dockerfile -t ruwen-fluent-bit-bug .
docker run -p 24225:24225 -it ruwen-fluent-bit-bug:latest
on my laptop.
I connect to TARGET_IP via a VPN. TARGET_IP is a t2.micro with rsyslog.
Now the following is happening:
- I connect to the VPN
- rsyslog is already running on TARGET_IP
- I start fluentbit on my laptop
- I start the webserver on my laptop
- I disconnect the vpn
- I trigger the endpoint to send all the log messages
- Once all messages are sent (the last few might still be buffered and send a second later), I reconnect to the VPN
- Now fluentbit is successfully sending some of the messages. The little rsyslog box is sometimes a bit overwhelmed so I need more retries.
- The VPN sometimes dies during this process, so I have to reconnect.
- I am keeping an eye on the logs. Since the logs tell me when a retry is expected, I kinda know when to wait and when not. And I always wait some additional time.
This way I can reproduce the issue reliable. This produces the following log: fluentbit_syslog.log
I tried to reproduce it by running syslog locally and use iptables to sometimes drop packages and sometimes not. Without success. But I managed to reproduce the same problem with the tcp output plugin. I tried that to make sure that the problem is unrelated to the syslog output plugin.
The config is used for that:
[SERVICE]
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
Flush 2
log_level TRACE
storage.path /fluent-bit/log
storage.metrics on
[INPUT]
Name forward
Listen 0.0.0.0
Port 24225
Buffer_Chunk_Size 64k
Buffer_Max_Size 2M
[OUTPUT]
Name tcp
Match *
host TARGET_IP
port 514
format json
# retry forever
retry_Limit False
net.connect_timeout 30
net.keepalive on
net.keepalive_idle_timeout 10
storage.total_limit_size 1G
I followed the same procedures as described. On my aws box I initially used nc, which stopped after each connection. Then I used it with the -k option, but that didn’t help. I ended up installed ncat via
ncat -l -k 514 > /dev/null
The whole restarting nc causes even more havoc. With the syslog test, I normally get 1 chunk stuck. Now I got even more. The resulting logfile. fluentbit_tcp.log
To see which chunks are missing, I wrote a little python script which parses those logs. It might be helpful.
#!/usr/bin/env python3
outstanding_chunks_to_error_msgs = dict()
with open('/tmp/fluentbit.log', 'r') as f:
for line in f:
line = line.strip()
line = line.replace("\x1b[1m[\x1b[0m", "").replace("\x1b[1m]\x1b[0m", "")
if 'failed to flush chunk' in line:
chunk = line.split("'")[1]
errors = outstanding_chunks_to_error_msgs.get(chunk)
parts = line.split(" ")
error_msg = f"{parts[0]} {parts[1]}: {parts[12]} {parts[13][:-1]}"
if errors:
errors.append(error_msg)
else:
outstanding_chunks_to_error_msgs[chunk]=[error_msg]
elif 'flush chunk' in line and 'succeeded at retry' in line:
chunk = line.split("'")[1]
del outstanding_chunks_to_error_msgs[chunk]
for chunk, errors in outstanding_chunks_to_error_msgs.items():
print(f"{chunk}: {errors}")
print(f"{len(outstanding_chunks_to_error_msgs)} missing")
Expected behavior Chunks being retried until they succeed.
Your Environment
- Version used: Docker fluent/fluent-bit:1.6.10. I also tried Docker fluent/fluent-bit:1.7.0-rc3, without success. Since some log messages are now missing, tracing the issue is harder, see https://github.com/fluent/fluent-bit/issues/3001
- Configuration: see reproduce section
- Environment name and version: local laptop
- Server type and version: 16 GB RAM, 8 Cores (Intel® Core™ i7-8650U CPU @ 1.90GHz), SSD
- Operating System and version:
uname -a
Linux ruwenwork 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux
- Filters and plugins: see reproduce section
Additional context How has this issue affected you? What are you trying to accomplish?
I am currently building a logging pipeline to use fluent-bit. My goal is to be able to ingest 256 MB/minute (that’s peak, I normally don’t need that) and to be able buffer data for a while when upstream is having issues. I am currently not imposing a memory limit, I didn’t get that far in my tests.
Please let me know if you need any further details. I am more than happy to build any other branch and try my luck with it.
Cheers Ruwen
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 76 (32 by maintainers)
I don’t understand why nobody takes care of this problem that seems evident is affecting a lot of users.
@edsiper @fujimotos @koleini could you please take a look?
Hey folks, updating @elruwen gave us a really nice repro and we will be freeing up after fixing some of the odd DNS timeout issues in 1.8.7. Realistically, will have some more details after FluentCon NA (Oct 12). I’ll also bring up this issue in the community meeting Thursday Sept 30 @ 9:00 AM PST
The same with 1.8.3, we stuck with 1.6.10 last working version
Seeing the same issues still with 1.8.1
Amazing job @leonardo-albertovich, both branches pass the tests!
The log messages to notify that the retry was successful are still missing (see https://github.com/fluent/fluent-bit/issues/3001).
And are you guys still interested in the tests? https://github.com/fluent/fluent-bit/pull/4162
@elruwen thanks for running that test - when the chunk couldn’t be flushed are you seeing any error codes from the end output (Syslog server / Splunk or Elasticsearch) such as a 400 bad message error? Also do you have some information regarding the I/O on the machine as the chunks get stuck - looking for some patterns
@pierluigilenoci I am looking at this 😃 - a few folks are working on DNS TCP / UDP resolution improvements / IO improvements and chunk optimizations within the 1.8.x series. We do need to do a better job of showcasing what folks are working on and we welcome ideas from the community on how best to demonstrate that. I would ask if you can join our next community meeting (Last Thursday of each August) with suggestions on how best to view what folks are working on as well as ways that you could help contribute in that process.
Anything else I can do so that this gets addressed? I am happy to create an end2end runnable test in java which can run as part of the build, but since I don’t know C, I would write it in Java. Not sure how keen you are to have Java in your project. I could try my luck in Go, but I never wrote something like that in Go before… 😉
I used 1.7.6 with and without the workers statement. In both cases the result is the same and way worse than before. Before only a few chunks got stuck, no none of them of them are retried.
Previously a log message was emitted which said something like
[engine] failed to flush chunk '1-1613013266.34830441.flb', retry in 9 seconds: task_id=0, input=tcp.0 > output=tcp.0see https://github.com/fluent/fluent-bit/blob/master/src/flb_engine.c#L281This log message is completely missing. I attached the logs from the run without workers being set. log-1.7.6.zip
@agup006 any update?
I just tried 1.8.3.
As said before, happy to help out further if I can somehow.
I did that and my results are:
I ran tcpdump for 20 minutes after the last chunk arrived for 1.7.7, but no further traffic. I always got 26 or 27 chunks missing.
I was wondering if it is a problem with nc, so I wrote a small Java server. Exactly the same issue.
I removed storage, same problem exists.
I encountered similar issues here https://github.com/fluent/fluent-bit/issues/3299 but even after i disable the retry forever the logs are not being sent anymore(old or new chunks) and they just gather in memory until the process crash due to OOM . The problem persists even with the latest version My only workaround was to increase Emitter_Mem_Buf_Limit for each filter to very high values (like a few GB) and pray that the buffer doesn’t get filled or any logs won’t be sent to ES anymore
Problem still exists with 1.7.4. My other ticket (the one about the log message) is also still happening.
https://github.com/fluent/fluent-bit/blob/master/src/flb_engine.c#L198
I never see that message, even though I used to see it. Maybe that helps pointing in the right direction.
I tested it the last time with
1.7.3. Still an issue.@pierluigilenoci Sorry mb. The code has been merged into master by maintainer https://github.com/fluent/fluent-bit/commit/1938ec845255dd27a1b0ad39e1654b3b351277be