fluent-bit: http_server stops listening almost after start
Bug Report
Describe the bug fluent-bit stops listening on http_server socket almost immediately after start when running on AWS EKS with Bottlerocket OS on nodes. I was able to sample netstat output at container start, there is about 1 second between samplings:
% kubectl -n logging exec -it fluent-bit-8gr46 -- sh
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:2020 0.0.0.0:* LISTEN 1/fluent-bit
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:2020 0.0.0.0:* LISTEN 1/fluent-bit
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
19 root 0:00 sh
38 root 0:00 ps aux
In the same time rest of fluent-bit was working okay, inputs, filters and output were processing just fine. I tried to run with dummy input and null output, result the same. Logs doesn’t say much about http_server:
Fluent Bit v1.7.5
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2021/05/20 14:26:29] [ info] Configuration:
[2021/05/20 14:26:29] [ info] flush time | 1.000000 seconds
[2021/05/20 14:26:29] [ info] grace | 5 seconds
[2021/05/20 14:26:29] [ info] daemon | 0
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info] inputs:
[2021/05/20 14:26:29] [ info] dummy
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info] filters:
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info] outputs:
[2021/05/20 14:26:29] [ info] null.0
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info] collectors:
[2021/05/20 14:26:29] [ info] [engine] started (pid=1)
[2021/05/20 14:26:29] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2021/05/20 14:26:29] [debug] [storage] [cio stream] new stream registered: dummy.0
[2021/05/20 14:26:29] [ info] [storage] version=1.1.1, initializing...
[2021/05/20 14:26:29] [ info] [storage] in-memory
[2021/05/20 14:26:29] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/05/20 14:26:29] [debug] [null:null.0] created event channels: read=18 write=19
[2021/05/20 14:26:29] [debug] [router] default match rule dummy.0:null.0
[2021/05/20 14:26:29] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/05/20 14:26:29] [ info] [sp] stream processor started
[2021/05/20 14:26:31] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:31] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:31] [debug] [out coro] cb_destroy coro_id=0
[2021/05/20 14:26:31] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:32] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:32] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:32] [debug] [out coro] cb_destroy coro_id=1
[2021/05/20 14:26:32] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:33] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:33] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:33] [debug] [out coro] cb_destroy coro_id=2
[2021/05/20 14:26:33] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:34] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:34] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:34] [debug] [out coro] cb_destroy coro_id=3
[2021/05/20 14:26:34] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:35] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:35] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:35] [debug] [out coro] cb_destroy coro_id=4
[2021/05/20 14:26:35] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:36] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:36] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:36] [debug] [out coro] cb_destroy coro_id=5
[2021/05/20 14:26:36] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:37] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:37] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:37] [debug] [out coro] cb_destroy coro_id=6
To Reproduce
- Launch AWS EKS with Bottlerocket OS nodes (https://docs.aws.amazon.com/eks/latest/userguide/launch-node-bottlerocket.html)
- Deploy fluent-bit chart (https://github.com/fluent/helm-charts/tree/main/charts/fluent-bit)
- Observe fluent-bit pods go into CrashLoopBackOff because of liveness and readiness probes fail due to inaccessible fluent-bit’s http_server at port 2020.
Expected behavior http_server should remain available while fluent-bit is running
Your Environment
- Version used: fluent-bit 1.7.5
- Configuration: helm chart defaults, but using dummy input and null output
- Environment name and version (e.g. Kubernetes? What version?): AWS EKS 1.20
- Server type and version: m5.xlarge
- Operating System and version: Latest Bottlerocket OS
- Filters and plugins: none
Additional context As far as able to understand, the issue with http_server happens only when running on Bottlerocket, nodes with Amazon Linux 2 run fluent-bit just fine.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 33 (7 by maintainers)
Is anyone looking into this issue?
We are also able to reproduce the issue and are also on linux kernel 5.10.
But it is possible to workaround it by getting the fluentbit pods to the guaranteed qos class by setting resources requests for cpu and memory equal to limits
Fixed with https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.1.3
But this is kinda workaround, by making kubelet’s
cpuManagerPolicy: none
as default. WithcpuManagerPolicy: static
, the issue still persists.Thanks @nokute78 , when will v1.9.0 be released?
@TomasKohout Thank you for reporting.
I shared your comment. https://github.com/monkey/monkey/pull/354#issuecomment-1035563344
Can anyone test below branch ? v1.8.7 + retry epoll_wait when EINTR occurred. https://github.com/nokute78/fluent-bit/tree/epoll_debug
May not be related to fluentbit itself, related issue may be https://github.com/kubernetes/kubernetes/issues/104280 .
Without https://github.com/kubernetes/kubernetes/pull/103746 we are currently not able to reproduce it anymore 😃
Looks for me like it is not related to the kernel, but to kubernetes and the runc dependency instead.
v1.21.3
v1.22.0
So I assume this issue got introduced by for the master kubernetes branch in https://github.com/kubernetes/kubernetes/pull/103743
Looks like the Bottlerocket folks have done some investigation at https://github.com/bottlerocket-os/bottlerocket/issues/1628