fluent-bit: http_server stops listening almost after start

Bug Report

Describe the bug fluent-bit stops listening on http_server socket almost immediately after start when running on AWS EKS with Bottlerocket OS on nodes. I was able to sample netstat output at container start, there is about 1 second between samplings:

% kubectl -n logging exec -it fluent-bit-8gr46 -- sh
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:2020            0.0.0.0:*               LISTEN      1/fluent-bit
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:2020            0.0.0.0:*               LISTEN      1/fluent-bit
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
   19 root      0:00 sh
   38 root      0:00 ps aux

In the same time rest of fluent-bit was working okay, inputs, filters and output were processing just fine. I tried to run with dummy input and null output, result the same. Logs doesn’t say much about http_server:

Fluent Bit v1.7.5
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/05/20 14:26:29] [ info] Configuration:
[2021/05/20 14:26:29] [ info]  flush time     | 1.000000 seconds
[2021/05/20 14:26:29] [ info]  grace          | 5 seconds
[2021/05/20 14:26:29] [ info]  daemon         | 0
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info]  inputs:
[2021/05/20 14:26:29] [ info]      dummy
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info]  filters:
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info]  outputs:
[2021/05/20 14:26:29] [ info]      null.0
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info]  collectors:
[2021/05/20 14:26:29] [ info] [engine] started (pid=1)
[2021/05/20 14:26:29] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2021/05/20 14:26:29] [debug] [storage] [cio stream] new stream registered: dummy.0
[2021/05/20 14:26:29] [ info] [storage] version=1.1.1, initializing...
[2021/05/20 14:26:29] [ info] [storage] in-memory
[2021/05/20 14:26:29] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/05/20 14:26:29] [debug] [null:null.0] created event channels: read=18 write=19
[2021/05/20 14:26:29] [debug] [router] default match rule dummy.0:null.0
[2021/05/20 14:26:29] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/05/20 14:26:29] [ info] [sp] stream processor started
[2021/05/20 14:26:31] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:31] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:31] [debug] [out coro] cb_destroy coro_id=0
[2021/05/20 14:26:31] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:32] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:32] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:32] [debug] [out coro] cb_destroy coro_id=1
[2021/05/20 14:26:32] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:33] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:33] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:33] [debug] [out coro] cb_destroy coro_id=2
[2021/05/20 14:26:33] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:34] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:34] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:34] [debug] [out coro] cb_destroy coro_id=3
[2021/05/20 14:26:34] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:35] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:35] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:35] [debug] [out coro] cb_destroy coro_id=4
[2021/05/20 14:26:35] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:36] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:36] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:36] [debug] [out coro] cb_destroy coro_id=5
[2021/05/20 14:26:36] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:37] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:37] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:37] [debug] [out coro] cb_destroy coro_id=6

To Reproduce

Launch AWS EKS with Bottlerocket OS nodes (https://docs.aws.amazon.com/eks/latest/userguide/launch-node-bottlerocket.html)
Deploy fluent-bit chart (https://github.com/fluent/helm-charts/tree/main/charts/fluent-bit)
Observe fluent-bit pods go into CrashLoopBackOff because of liveness and readiness probes fail due to inaccessible fluent-bit’s http_server at port 2020.

Expected behavior http_server should remain available while fluent-bit is running

Your Environment

Version used: fluent-bit 1.7.5
Configuration: helm chart defaults, but using dummy input and null output
Environment name and version (e.g. Kubernetes? What version?): AWS EKS 1.20
Server type and version: m5.xlarge
Operating System and version: Latest Bottlerocket OS
Filters and plugins: none

Additional context As far as able to understand, the issue with http_server happens only when running on Bottlerocket, nodes with Amazon Linux 2 run fluent-bit just fine.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 4
Comments: 33 (7 by maintainers)

Most upvoted comments

Is anyone looking into this issue?

hussainsaify on Jun 28, 2021

We are also able to reproduce the issue and are also on linux kernel 5.10.

But it is possible to workaround it by getting the fluentbit pods to the guaranteed qos class by setting resources requests for cpu and memory equal to limits

chrischdi on Aug 5, 2021

Fixed with https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.1.3

But this is kinda workaround, by making kubelet’s cpuManagerPolicy: none as default. With cpuManagerPolicy: static, the issue still persists.

z0rc on Jul 13, 2021

Thanks @nokute78 , when will v1.9.0 be released?

ryanlyy on Mar 7, 2022

@TomasKohout Thank you for reporting.

I shared your comment. https://github.com/monkey/monkey/pull/354#issuecomment-1035563344

nokute78 on Feb 10, 2022

Can anyone test below branch ? v1.8.7 + retry epoll_wait when EINTR occurred. https://github.com/nokute78/fluent-bit/tree/epoll_debug

git clone git@github.com:nokute78/fluent-bit.git
cd fluent-bit
git switch epoll_debug
sudo docker build -f 'dockerfiles/Dockerfile.x86_64-master' .

nokute78 on Sep 26, 2021

May not be related to fluentbit itself, related issue may be https://github.com/kubernetes/kubernetes/issues/104280 .

Without https://github.com/kubernetes/kubernetes/pull/103746 we are currently not able to reproduce it anymore 😃

chrischdi on Aug 12, 2021

Looks for me like it is not related to the kernel, but to kubernetes and the runc dependency instead.

I’m not able to reproduce it on upstream v1.21.3
I’m able to reproduce it on upstream v1.22.0
I’m able to reproduce it when cherry-picking https://github.com/kubernetes/kubernetes/pull/103746

So I assume this issue got introduced by for the master kubernetes branch in https://github.com/kubernetes/kubernetes/pull/103743

chrischdi on Aug 10, 2021

Looks like the Bottlerocket folks have done some investigation at https://github.com/bottlerocket-os/bottlerocket/issues/1628

backjo on Jun 29, 2021