aws-for-fluent-bit: OOMs/crashes when using fluent-bit 1.8.11

Describe the question/issue

We are seeing constant OOMKills when upgrading to fluent-bit@1.8.7+ - most recently tried on fluent-bit@1.8.11

These OOMs happen within minutes of the fluent-bit pod starting and we see them without significant log traffic - our app is just logging health checks and status messages.

It does not appear to be a slow memory leak - we do not see memory climb slowly and then reach out pod limit. Rather memory is low and the pod receives an OOMKill.

When downgrading to fluent-bit@1.7.9 we do not see these OOMKills and fluent-bit is able to run fine.

If I remove the cloudwatch_logs output, I do not see the OOMKills.

However fluent-bit CPU usage is significantly higher in v1.8.11 vs v1.7.9 with both cloudwatch_logs and without cloudwatch_logs

1.7.9

~$ kubectl top pods -n logging --containers --use-protocol-buffers
POD                NAME         CPU(cores)   MEMORY(bytes)
fluent-bit-49gpm   fluent-bit   6m           17Mi
fluent-bit-59kcv   fluent-bit   7m           15Mi
fluent-bit-62wzg   fluent-bit   7m           12Mi
fluent-bit-djnbn   fluent-bit   6m           11Mi
fluent-bit-frfnx   fluent-bit   6m           15Mi
fluent-bit-tg7ng   fluent-bit   6m           12Mi
fluent-bit-zngds   fluent-bit   5m           16Mi

1.8.11

~/Code/catalytic-web (release/iqt-1.1.0|REBASE-i 1/1)$  kubectl top pods -n logging --containers --use-protocol-buffers
POD                NAME         CPU(cores)   MEMORY(bytes)
fluent-bit-czkhw   fluent-bit   944m         23Mi
fluent-bit-frfl7   fluent-bit   678m         16Mi
fluent-bit-nzz7m   fluent-bit   1075m        16Mi
fluent-bit-t26j4   fluent-bit   717m         26Mi

See https://github.com/fluent/fluent-bit/issues/4192 for more discussion

Logs sent by our system during a few minute stretch can be found here: logs.log

Configuration

Config map: cm.yaml.txt Pod: pod.yaml.txt

Fluent Bit Log Output

fluent-bit debug logs are here: fluent-bit.log

Fluent Bit Version Info

v1.8.11

We do not see the issue on v1.7.9

Cluster Details

do you use App Mesh or a service mesh? No
does you use VPC endpoints in a network restricted VPC? No
Is throttling from the destination part of the problem? Please note that occasional transient network connection errors are often caused by exceeding limits. For example, CW API can block/drop Fluent Bit connections when throttling is triggered. Not sure, but seems unlikely given how consistent it is and how low of log volume it is.
ECS or EKS EKS
Fargate or EC2 EC2
Daemon or Sidecar deployment for Fluent Bit Daemon

Application Details

Steps to reproduce issue

Run fluent-bit@1.8.11 with the provided config and wait
OOMKills on the fluent-bit pod

Related Issues

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 27 (15 by maintainers)

Most upvoted comments

@dylanlingelbach Yea we think that somewhere in the 1.8.x series this problem began

PettitWesley on Dec 16, 2021