fluent-bit: Fluent-bit get stuck after few minutes on Kubernetes 1.22

Bug Report

Describe the bug

With the new Kubernetes 1.22 when we try to run fluent-bit as a Pod it gets stuck after few minutes and does not work anymore.

If we restart the Pod manually it works again for again few minutes.

To Reproduce I reproduced it with a really simple fluent-bit pod and config

Just deployed a classic Kubernetes 1.22.1 using kubeadm with docker (and added calico as CNI)

Then created a Pod and a ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit
data:
  fluent-bit.conf: |-
    [SERVICE]
        HTTP_Server    On
        HTTP_Listen    0.0.0.0
        HTTP_PORT      2020
        Flush          1
        Daemon         Off
        Log_Level      debug
        Health_Check   On
    [INPUT]
        Name dummy
        Dummy {"top": {".dotted": "value"}}
    [OUTPUT]
        Name stdout

---

apiVersion: v1
kind: Pod
metadata:
  name: fluent-bit
spec:
  containers:
  - image: docker.io/fluent/fluent-bit:1.8.6
    imagePullPolicy: IfNotPresent
    name: fluent-bit-new
    ports:
    - containerPort: 2020
      name: http-metrics
    volumeMounts:
    - mountPath: /fluent-bit/etc
      name: config
  volumes:
  - configMap:
      name: fluent-bit
    name: config 

The fluent-bit is really simple and just here for testing.

Everything works well and I get expected output from the Pod every second

{"log":"[0] dummy.0: [1630923559.270831593, {\"top\"=\u003e{\".dotted\"=\u003e\"value\"}}]\n","stream":"stdout","time":"2021-09-06T10:19:20.270984285Z"}
{"log":"[0] dummy.0: [1630923560.270835685, {\"top\"=\u003e{\".dotted\"=\u003e\"value\"}}]\n","stream":"stdout","time":"2021-09-06T10:19:21.271003948Z"}

And after few minutes (on my tests ~3-4 minutes) Pod get stuck and I do not get any output

Expected behavior

Fluent-bit shouldn’t be stuck after few minutes

Your Environment

  • Version used: v1.8.6 (but also tested with v1.8.4 and it’s the same)
  • Configuration:
    [SERVICE]
        HTTP_Server    On
        HTTP_Listen    0.0.0.0
        HTTP_PORT      2020
        Flush          1
        Daemon         Off
        Log_Level      debug
        Health_Check   On
    [INPUT]
        Name dummy
        Dummy {"top": {".dotted": "value"}}
    [OUTPUT]
        Name stdout
    
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.22.1
  • Server type and version: VM
  • Operating System and version: CentOs 7.9
  • Filters and plugins:

Additional context

Note that if I downgrade kubelet to 1.21.4, for example, it works well and the fluent-bit pod does not get stuck

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 20 (5 by maintainers)

Commits related to this issue

Most upvoted comments

Note: https://github.com/fluent/fluent-bit/issues/4063#issuecomment-914463155

[pid  4781] write(32, "\203\245input\201\247dummy.0\202\247records\314\324\245byte"..., 158 <unfinished ...>

This data is metrics data which we can get from /api/v1/metrics.

$ curl localhost:2020/api/v1/metrics 
{"input":{"dummy.0":{"records":3,"bytes":78}},"filter":{},"output":{"stdout.0":{"proc_records":0,"proc_bytes":0,"errors":0,"retries":0,"retries_failed":0,"dropped_records":0,"retried_records":0}}}

Its format is messagepack.(\203 (=0x83 ) means fixmap which size is 3. \245 =(0xa5) means fixstr which size is 5(=input).) collect_metrics (create metrics data in messagepack) -> flb_hs_push_pipeline_metrics -> mk_mq_send -> mk_fifo_send -> msg_write -> write (here)

records\314\324 means uint8(0xcc) + 212.

Below is the value in JSON.

{"input":{"dummy.0":{"records":212, "byte

All counters indicate "record":212. Dummy plugin didn’t ingest record.

[pid  4781] write(32, "\203\245input\201\247dummy.0\202\247records\314\324\245byte"..., 158 <unfinished ...>
[pid  4781] write(32, "\203\245input\201\247dummy.0\202\247records\314\324\245byte"..., 158 <unfinished ...>
[pid  4781] write(32, "\203\245input\201\247dummy.0\202\247records\314\324\245byte"..., 158 <unfinished ...>
[pid  4781] write(32, "\203\245input\201\247dummy.0\202\247records\314\324\245byte"..., 158 <unfinished ...>
[pid  4781] write(32, "\203\245input\201\247dummy.0\202\247records\314\324\245byte"..., 158 <unfinished ...>
[pid  4781] write(32, "\203\245input\201\247dummy.0\202\247records\314\324\245byte"..., 158 <unfinished ...>
[pid  4781] write(32, "\203\245input\201\247dummy.0\202\247records\314\324\245byte"..., 158 <unfinished ...>

After a bit more investigation it seems linked to HTTP_Server when I disabled this one the Pod does not get stuck