fluent-bit: http_server stops listening almost after start

Bug Report

Describe the bug fluent-bit stops listening on http_server socket almost immediately after start when running on AWS EKS with Bottlerocket OS on nodes. I was able to sample netstat output at container start, there is about 1 second between samplings:

% kubectl -n logging exec -it fluent-bit-8gr46 -- sh
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:2020            0.0.0.0:*               LISTEN      1/fluent-bit
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:2020            0.0.0.0:*               LISTEN      1/fluent-bit
/ # netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
   19 root      0:00 sh
   38 root      0:00 ps aux

In the same time rest of fluent-bit was working okay, inputs, filters and output were processing just fine. I tried to run with dummy input and null output, result the same. Logs doesn’t say much about http_server:

Fluent Bit v1.7.5
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/05/20 14:26:29] [ info] Configuration:
[2021/05/20 14:26:29] [ info]  flush time     | 1.000000 seconds
[2021/05/20 14:26:29] [ info]  grace          | 5 seconds
[2021/05/20 14:26:29] [ info]  daemon         | 0
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info]  inputs:
[2021/05/20 14:26:29] [ info]      dummy
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info]  filters:
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info]  outputs:
[2021/05/20 14:26:29] [ info]      null.0
[2021/05/20 14:26:29] [ info] ___________
[2021/05/20 14:26:29] [ info]  collectors:
[2021/05/20 14:26:29] [ info] [engine] started (pid=1)
[2021/05/20 14:26:29] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2021/05/20 14:26:29] [debug] [storage] [cio stream] new stream registered: dummy.0
[2021/05/20 14:26:29] [ info] [storage] version=1.1.1, initializing...
[2021/05/20 14:26:29] [ info] [storage] in-memory
[2021/05/20 14:26:29] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/05/20 14:26:29] [debug] [null:null.0] created event channels: read=18 write=19
[2021/05/20 14:26:29] [debug] [router] default match rule dummy.0:null.0
[2021/05/20 14:26:29] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/05/20 14:26:29] [ info] [sp] stream processor started
[2021/05/20 14:26:31] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:31] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:31] [debug] [out coro] cb_destroy coro_id=0
[2021/05/20 14:26:31] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:32] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:32] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:32] [debug] [out coro] cb_destroy coro_id=1
[2021/05/20 14:26:32] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:33] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:33] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:33] [debug] [out coro] cb_destroy coro_id=2
[2021/05/20 14:26:33] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:34] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:34] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:34] [debug] [out coro] cb_destroy coro_id=3
[2021/05/20 14:26:34] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:35] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:35] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:35] [debug] [out coro] cb_destroy coro_id=4
[2021/05/20 14:26:35] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:36] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:36] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:36] [debug] [out coro] cb_destroy coro_id=5
[2021/05/20 14:26:36] [debug] [task] destroy task=0x7f9823c371e0 (task_id=0)
[2021/05/20 14:26:37] [debug] [task] created task=0x7f9823c371e0 id=0 OK
[2021/05/20 14:26:37] [debug] [output:null:null.0] discarding 26 bytes
[2021/05/20 14:26:37] [debug] [out coro] cb_destroy coro_id=6

To Reproduce

Expected behavior http_server should remain available while fluent-bit is running

Your Environment

  • Version used: fluent-bit 1.7.5
  • Configuration: helm chart defaults, but using dummy input and null output
  • Environment name and version (e.g. Kubernetes? What version?): AWS EKS 1.20
  • Server type and version: m5.xlarge
  • Operating System and version: Latest Bottlerocket OS
  • Filters and plugins: none

Additional context As far as able to understand, the issue with http_server happens only when running on Bottlerocket, nodes with Amazon Linux 2 run fluent-bit just fine.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 33 (7 by maintainers)

Most upvoted comments

Is anyone looking into this issue?

We are also able to reproduce the issue and are also on linux kernel 5.10.

But it is possible to workaround it by getting the fluentbit pods to the guaranteed qos class by setting resources requests for cpu and memory equal to limits

Fixed with https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.1.3

But this is kinda workaround, by making kubelet’s cpuManagerPolicy: none as default. With cpuManagerPolicy: static, the issue still persists.

Thanks @nokute78 , when will v1.9.0 be released?

Can anyone test below branch ? v1.8.7 + retry epoll_wait when EINTR occurred. https://github.com/nokute78/fluent-bit/tree/epoll_debug

git clone git@github.com:nokute78/fluent-bit.git
cd fluent-bit
git switch epoll_debug
sudo docker build -f 'dockerfiles/Dockerfile.x86_64-master' .

May not be related to fluentbit itself, related issue may be https://github.com/kubernetes/kubernetes/issues/104280 .

Without https://github.com/kubernetes/kubernetes/pull/103746 we are currently not able to reproduce it anymore 😃

Looks for me like it is not related to the kernel, but to kubernetes and the runc dependency instead.

So I assume this issue got introduced by for the master kubernetes branch in https://github.com/kubernetes/kubernetes/pull/103743

Looks like the Bottlerocket folks have done some investigation at https://github.com/bottlerocket-os/bottlerocket/issues/1628