fluent-bit: Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe
Bug Report
My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters. Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port. After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe. Here’s the stack trace
0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1 0x000055a127ecad01 in flb_output_return (ret=1, th=0x7fac548cd240) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:545
#2 0x000055a127ecade9 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:576
#3 0x000055a127eccca1 in cb_es_flush (data=0x7fac5991e0b2, bytes=3842,
tag=0x7fac5437fd20 "kube.gws.gws-platform-datacollector-blue-6c474d7c84-wsdcw.gws-platform-datacollector-blue.2db22adbd3daaaf836a3f6311f4b3e5ad9ec7727280458ac68868419fb758ab9", tag_len=154,
ins=0x7fac5949e480, out_context=0x7fac56c6bc00, config=0x7fac59432c80) at /tmp/fluent-bit/plugins/out_es/es.c:748
#4 0x000055a127e72649 in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:449
#5 0x000055a1282a6907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6 0x3039663238613165 in ?? ()
#
strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)
...
write(14, "\0\200\1\20\2\0\0\0", 8) = 8
read(13, "\0\0\36\20\2\0\0\0", 8) = 8
read(13, "\0\0\27\20\2\0\0\0", 8) = 8
read(13, "\0\300\37\20\2\0\0\0", 8) = 8
write(14, "\0\200\25\20\2\0\0\0", 8) = 8
read(13, "\0\200\17\20\2\0\0\0", 8) = 8
write(14, "\0\200\17\20\2\0\0\0", 8) = 8
write(14, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0\300\37\20\2\0\0\0", 8) = 8
read(13, "\0\0\33\20\2\0\0\0", 8) = 8
read(13, "\0\300\r\20\2\0\0\0", 8) = 8
write(14, "\0\0\36\20\2\0\0\0", 8) = 8
read(13, "\0\300\24\20\2\0\0\0", 8) = 8
write(14, "\0\300\r\20\2\0\0\0", 8) = 8
write(14, "\0\300\24\20\2\0\0\0", 8) = 8
read(13, "\0\200\1\20\2\0\0\0", 8) = 8
write(14, "\0\0\33\20\2\0\0\0", 8) = 8
read(13, "\0\200\25\20\2\0\0\0", 8) = 8
read(13, "\0\200\17\20\2\0\0\0", 8) = 8
write(14, "\0\200\17\20\2\0\0\0", 8) = 8
read(13, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0@\16\20\2\0\0\0", 8) = 8
write(14, "\0\200\1\20\2\0\0\0", 8) = 8
read(13, "\0\300\37\20\2\0\0\0", 8) = 8
read(13, "\0\0\36\20\2\0\0\0", 8) = 8
write(14, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0\0\36\20\2\0\0\0", 8) = 8
write(14, "\0\300\37\20\2\0\0\0", 8) = 8
read(13, "\0\300\r\20\2\0\0\0", 8) = 8
read(13, "\0\300\24\20\2\0\0\0", 8) = 8
read(13, "\0\0\33\20\2\0\0\0", 8) = 8
write(14, "\0\300\r\20\2\0\0\0", 8) = 8
read(13, "\0\200\17\20\2\0\0\0", 8) = 8
read(13, "\0@\16\20\2\0\0\0", 8) = 8
read(13, "\0\200\1\20\2\0\0\0", 8) = 8
write(14, "\0\300\24\20\2\0\0\0", 8) = 8
write(14, "\0\200\1\20\2\0\0\0", 8) = 8
write(14, "\0@\16\20\2\0\0\0", 8) = 8
read(13, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0\0\33\20\2\0\0\0", 8) = 8
read(13, "\0\0\36\20\2\0\0\0", 8) = 8
read(13, "\0\300\37\20\2\0\0\0", 8) = 8
write(14, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0\0\36\20\2\0\0\0", 8) = 8
read(13, "\0\300\r\20\2\0\0\0", 8) = 8
write(14, "\0\300\37\20\2\0\0\0", 8) = 8
read(13, "\0\300\24\20\2\0\0\0", 8) = 8
read(13, "\0\200\1\20\2\0\0\0", 8) = 8
read(13, "\0@\16\20\2\0\0\0", 8) = 8
write(14, "\0\200\1\20\2\0\0\0", 8) = 8
write(14, "\0\300\r\20\2\0\0\0", 8) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.
I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.
flb_engine.c:
/*
* Create a communication channel: this routine creates a channel to
* signal the Engine event loop. It's useful to stop the event loop
* or to instruct anything else without break.
*/
ret = mk_event_channel_create(config->evl,
&config->ch_manager[0],
&config->ch_manager[1],
config);
if (ret != 0) {
flb_error("[engine] could not create manager channels");
return -1;
}
flb_pipe_set_nonblocking(&config->ch_manager[1]); /* <----- I made it non-blocking ------- */
there’s probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.
Environment
Version used: 1.5.7 (container fluent/fluent-bit:1.5.7) kubernetes 1.16.13 docker 3.0.10+azure Ubuntu 16.04.1
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 20
- Comments: 62 (18 by maintainers)
Commits related to this issue
- output: plugins now uses an independent channel for events (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
- engine: manage output envents from indepentent channel (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
- engine: manage output events from independent channel (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
- engine: manage output events from independent channel (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
- output: plugins now uses an independent channel for events (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
- output: plugins now uses an independent channel for events (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
The issue is fixed by upgrading the fluent-bit version from v1.8 to v1.8.1. Also, worked for v.1.7.9
Seeing this issue as well. Any updates on this? Any information we can capture and provide to attempt to help get this issue moving towards a fix? Looks like it has been around for several releases and makes fluent-bit unreliable in high load but also unpredictably it seems. We are seeing this in multiple Kubernetes environments and there is no viable work around to detect that won’t have the possibility of missing logs.
I believe we have been hitting this too, on a deployment of roughly 700 nodes across around 40 clusters, we were seeing around 8% of pods getting stuck. Attaching a debugger to the stuck instances we saw a similar stack trace:
If we detach, wait and reattach it is still stuck there.
We think this is maybe only affecting nodes with either large numbers of files tracked, or large log volumes, or both.
When we terminate the stuck pod and it gets replaced we see it is very likely to get stuck again within the next ten to fifteen minutes. We assume that this is because it’s on a node that is hosting particularly busy pods or particularly many pods but we’re not sure.
We have tested the CAP_SYS_RESOURCE workaround on a subset of affected clusters and it does appear to remove the problem.
This is our current configuration:
I’ll note that the Mem_Buf_Limit is set very high, but that was an earlier attempt to fix the problem before we understood what was going wrong. The newrelic output plugin is this: https://github.com/newrelic/newrelic-fluent-bit-output
Some of the relevant environment variables:
The stack trace above is from 1.8.9 with Amazon’s patches, which we had been trying to see if they fixed anything, but we have been having this problem with 1.8.12 too.
Currently running into this issue with file buffering turned on and might have a theory why some folks might be running into it.
So, this issue might manifest depending on what the pipe buffer size is.
You could possibly end up with three possible pipe buffer sizes depending on what Linux kernel you are running and if you are hitting the kernel’s pipe buffer soft limits
Default pipe buffer size on Linux is going to be 65536 bytes. Since the max tasks you can run is 2048, this pipe size should easily handle all the tasks easily without the pipe being blocked.
If you are hitting the kernel pipe buffer soft limits, your pipe size might be either 4096 bytes (https://github.com/torvalds/linux/blob/v4.18/fs/pipe.c#L642) or 8192 bytes (https://github.com/torvalds/linux/blob/v5.15/fs/pipe.c#L797). On centos/rhel 7/8, your pipe size would be 4096 bytes, while newer kernels would be 8192 bytes. So, you can run into a scenario where you add too many tasks/coroutines at once and the pipe gets blocked on writing. This really means that the pipe can only really handle writing 512 or 1024 tasks and doesn’t have room for current max task size of 2048.
As a temp workaround, you could add CAP_SYS_RESOURCE capability to the fluent-bit docker container as that ensures the pipe buffer size is at least 65536 bytes.
A proper solution would be to alter the amount of tasks that can be run based on the pipe’s buffer.
Awesome Thanks @grep4error for all help. @edsiper This affecting more people https://github.com/fluent/fluent-bit/issues/2621 & https://github.com/fluent/fluent-bit/issues/2577. Please do the needful. We can help to validate fix. cc @tomerleib
Hey folks, updating 1.7.0 is out with higher performance, will also look at some of the issues that @servo1x has mentioned
Any updates?
Issues that could be related: #2830 #2950 #3006 #3039 #3014
So far, there are no more freezes on our side.
@edsiper We’re running into that same issue. I’m building the Docker image and deploying it here too and will try to report back. If anyone wants to quickly consume it, we’ve published it to our registry:
If you’re using the Helm charts, this should do:
@grep4error @avdhoot @tomerleib @mtparet @shirolimit @clebs
I’ve pushed a possible solution in branch ch_manager, would you please build it and test it? If you confirm the fix is good to go, I can merge it in the 1.6 series this week, but I need your help with proper feedback.
About the solution: now every output plugin instance has it owns channels to notify events through the event loop, in this way it does not saturate the main engine
ch_manager.