fluent-bit: Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe

Bug Report

My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters. Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port. After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe. Here’s the stack trace

0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055a127ecad01 in flb_output_return (ret=1, th=0x7fac548cd240) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:545
#2  0x000055a127ecade9 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:576
#3  0x000055a127eccca1 in cb_es_flush (data=0x7fac5991e0b2, bytes=3842,
    tag=0x7fac5437fd20 "kube.gws.gws-platform-datacollector-blue-6c474d7c84-wsdcw.gws-platform-datacollector-blue.2db22adbd3daaaf836a3f6311f4b3e5ad9ec7727280458ac68868419fb758ab9", tag_len=154,
    ins=0x7fac5949e480, out_context=0x7fac56c6bc00, config=0x7fac59432c80) at /tmp/fluent-bit/plugins/out_es/es.c:748
#4  0x000055a127e72649 in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:449
#5  0x000055a1282a6907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x3039663238613165 in ?? ()
#

strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)

...
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\200\25\20\2\0\0\0", 8)    = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\200\25\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.

I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.

flb_engine.c:

    /*
     * Create a communication channel: this routine creates a channel to
     * signal the Engine event loop. It's useful to stop the event loop
     * or to instruct anything else without break.
     */
    ret = mk_event_channel_create(config->evl,
                                  &config->ch_manager[0],
                                  &config->ch_manager[1],
                                  config);
    if (ret != 0) {
        flb_error("[engine] could not create manager channels");
        return -1;
    }

    flb_pipe_set_nonblocking(&config->ch_manager[1]); /* <----- I made it non-blocking ------- */

there’s probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.

Environment

Version used: 1.5.7 (container fluent/fluent-bit:1.5.7) kubernetes 1.16.13 docker 3.0.10+azure Ubuntu 16.04.1

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 20
Comments: 62 (18 by maintainers)

Commits related to this issue

output: plugins now uses an independent channel for events (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
engine: manage output envents from indepentent channel (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
engine: manage output events from independent channel (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
engine: manage output events from independent channel (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
output: plugins now uses an independent channel for events (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
output: plugins now uses an independent channel for events (#2661) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago

Most upvoted comments

The issue is fixed by upgrading the fluent-bit version from v1.8 to v1.8.1. Also, worked for v.1.7.9

Tafsiralam on Mar 7, 2022

Seeing this issue as well. Any updates on this? Any information we can capture and provide to attempt to help get this issue moving towards a fix? Looks like it has been around for several releases and makes fluent-bit unreliable in high load but also unpredictably it seems. We are seeing this in multiple Kubernetes environments and there is no viable work around to detect that won’t have the possibility of missing logs.

frankreno on Sep 22, 2021

I believe we have been hitting this too, on a deployment of roughly 700 nodes across around 40 clusters, we were seeing around 8% of pods getting stuck. Attaching a debugger to the stuck instances we saw a similar stack trace:

Thread 8 (Thread 0x7f25b61fd700 (LWP 13)):
#0  0x00007f25c231c459 in write () from /lib64/libpthread.so.0
#1  0x00000000004ff3e7 in flb_output_return (ret=1, co=0x7f25b5217308) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:649
#2  0x00000000004ff443 in flb_output_return_do (x=1) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:685
#3  0x00000000004ff4eb in flb_proxy_cb_flush (data=0x7f257439cd80, bytes=790, tag=0x7f25b41cab80 "kube.var.log.containers.off-boarding-manager-6aapa-74c94c54b5-k2nz5_apollo_off-boarding-manager-3477301c92e8e9da2d9c6769983a2c06107c2ebf139548995e7aca30777f639d.log", tag_len=164, i_ins=0x7f25bda3a680, out_context=0x7f25bda22330, config=0x7f25bda19980) at /tmp/fluent-bit-1.8.9/src/flb_plugin_proxy.c:65
#4  0x00000000004d6973 in output_pre_cb_flush () at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:514
#5  0x00000000009d3667 in co_init () at /tmp/fluent-bit-1.8.9/lib/monkey/deps/flb_libco/amd64.c:117

If we detach, wait and reattach it is still stuck there.

We think this is maybe only affecting nodes with either large numbers of files tracked, or large log volumes, or both.

When we terminate the stuck pod and it gets replaced we see it is very likely to get stuck again within the next ten to fifteen minutes. We assume that this is because it’s on a node that is hosting particularly busy pods or particularly many pods but we’re not sure.

We have tested the CAP_SYS_RESOURCE workaround on a subset of affected clusters and it does appear to remove the problem.

This is our current configuration:

    [SERVICE]
        Flush         1
        Log_Level     ${LOG_LEVEL}
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020


    [INPUT]
        Name              tail
        Tag               kube.*
        Path              ${PATH}
        Parser            ${LOG_PARSER}
        DB                ${FB_DB}
        Mem_Buf_Limit     200Mb
        Skip_Long_Lines   On
        Refresh_Interval  10


    [FILTER]
        Name                kubernetes
        Match               kube.*
        Use_Kubelet         On
        Buffer_Size         2Mb
        K8S-Logging.Exclude On
        K8S-Logging.Parser  On
        Merge_Log           On
        Keep_Log            Off
        tls.verify          On

    [FILTER]
        Name          nest
        Match         *
        Operation     lift
        Nested_under  kubernetes

    [FILTER]
        Name          nest
        Match         *
        Operation     lift
        Nested_under  labels
        Add_prefix    labels.

    [FILTER]
        Name             modify
        Match            *
        Rename           labels.app                       service.name
        Rename           labels.app.kubernetes.io/name    service.name
        Rename           labels.app.kubernetes.io/version service.version
        Rename           trace_id                         trace.id
        Remove_wildcard  labels.
        Remove_wildcard  annotations
        Remove           container_hash
        Remove           docker_id
        Remove           pod_id

    [FILTER]
        Name           record_modifier
        Match          *
        Record         cluster_name ${CLUSTER_NAME}


    [OUTPUT]
        Name           newrelic
        Match          *
        licenseKey     ${LICENSE_KEY}
        endpoint       ${ENDPOINT}
        lowDataMode    true

I’ll note that the Mem_Buf_Limit is set very high, but that was an earlier attempt to fix the problem before we understood what was going wrong. The newrelic output plugin is this: https://github.com/newrelic/newrelic-fluent-bit-output

Some of the relevant environment variables:

        - name: LOG_LEVEL
          value: info
        - name: LOG_PARSER
          value: docker
        - name: FB_DB
          value: /var/log/flb_kube.db
        - name: PATH
          value: /var/log/containers/*.log

The stack trace above is from 1.8.9 with Amazon’s patches, which we had been trying to see if they fixed anything, but we have been having this problem with 1.8.12 too.

annettejanewilson on Feb 25, 2022

Currently running into this issue with file buffering turned on and might have a theory why some folks might be running into it.

So, this issue might manifest depending on what the pipe buffer size is.

You could possibly end up with three possible pipe buffer sizes depending on what Linux kernel you are running and if you are hitting the kernel’s pipe buffer soft limits

Default pipe buffer size on Linux is going to be 65536 bytes. Since the max tasks you can run is 2048, this pipe size should easily handle all the tasks easily without the pipe being blocked.

If you are hitting the kernel pipe buffer soft limits, your pipe size might be either 4096 bytes (https://github.com/torvalds/linux/blob/v4.18/fs/pipe.c#L642) or 8192 bytes (https://github.com/torvalds/linux/blob/v5.15/fs/pipe.c#L797). On centos/rhel 7/8, your pipe size would be 4096 bytes, while newer kernels would be 8192 bytes. So, you can run into a scenario where you add too many tasks/coroutines at once and the pipe gets blocked on writing. This really means that the pipe can only really handle writing 512 or 1024 tasks and doesn’t have room for current max task size of 2048.

As a temp workaround, you could add CAP_SYS_RESOURCE capability to the fluent-bit docker container as that ensures the pipe buffer size is at least 65536 bytes.

A proper solution would be to alter the amount of tasks that can be run based on the pipe’s buffer.

danlenar on Nov 5, 2021

Awesome Thanks @grep4error for all help. @edsiper This affecting more people https://github.com/fluent/fluent-bit/issues/2621 & https://github.com/fluent/fluent-bit/issues/2577. Please do the needful. We can help to validate fix. cc @tomerleib

avdhoot on Oct 9, 2020

Hey folks, updating 1.7.0 is out with higher performance, will also look at some of the issues that @servo1x has mentioned

agup006 on Feb 17, 2021

Any updates?

Issues that could be related: #2830 #2950 #3006 #3039 #3014

servo1x on Feb 10, 2021

So far, there are no more freezes on our side.

mtparet on Dec 2, 2020

@edsiper We’re running into that same issue. I’m building the Docker image and deploying it here too and will try to report back. If anyone wants to quickly consume it, we’ve published it to our registry:

registry.public.mtl1.vexxhost.net/openstack/fluent-bit:1.6.6-1

If you’re using the Helm charts, this should do:

image:
  repository: registry.public.mtl1.vexxhost.net/openstack/fluent-bit
  tag: 1.6.6-1

mnaser on Nov 25, 2020

@grep4error @avdhoot @tomerleib @mtparet @shirolimit @clebs

I’ve pushed a possible solution in branch ch_manager, would you please build it and test it? If you confirm the fix is good to go, I can merge it in the 1.6 series this week, but I need your help with proper feedback.

About the solution: now every output plugin instance has it owns channels to notify events through the event loop, in this way it does not saturate the main engine ch_manager.

edsiper on Nov 17, 2020