aws-for-fluent-bit: OOMs/crashes when using fluent-bit 1.8.11

Describe the question/issue

We are seeing constant OOMKills when upgrading to fluent-bit@1.8.7+ - most recently tried on fluent-bit@1.8.11

These OOMs happen within minutes of the fluent-bit pod starting and we see them without significant log traffic - our app is just logging health checks and status messages.

It does not appear to be a slow memory leak - we do not see memory climb slowly and then reach out pod limit. Rather memory is low and the pod receives an OOMKill.

When downgrading to fluent-bit@1.7.9 we do not see these OOMKills and fluent-bit is able to run fine.

If I remove the cloudwatch_logs output, I do not see the OOMKills.

However fluent-bit CPU usage is significantly higher in v1.8.11 vs v1.7.9 with both cloudwatch_logs and without cloudwatch_logs

1.7.9

~$ kubectl top pods -n logging --containers --use-protocol-buffers
POD                NAME         CPU(cores)   MEMORY(bytes)
fluent-bit-49gpm   fluent-bit   6m           17Mi
fluent-bit-59kcv   fluent-bit   7m           15Mi
fluent-bit-62wzg   fluent-bit   7m           12Mi
fluent-bit-djnbn   fluent-bit   6m           11Mi
fluent-bit-frfnx   fluent-bit   6m           15Mi
fluent-bit-tg7ng   fluent-bit   6m           12Mi
fluent-bit-zngds   fluent-bit   5m           16Mi

1.8.11

~/Code/catalytic-web (release/iqt-1.1.0|REBASE-i 1/1)$  kubectl top pods -n logging --containers --use-protocol-buffers
POD                NAME         CPU(cores)   MEMORY(bytes)
fluent-bit-czkhw   fluent-bit   944m         23Mi
fluent-bit-frfl7   fluent-bit   678m         16Mi
fluent-bit-nzz7m   fluent-bit   1075m        16Mi
fluent-bit-t26j4   fluent-bit   717m         26Mi

See https://github.com/fluent/fluent-bit/issues/4192 for more discussion

Logs sent by our system during a few minute stretch can be found here: logs.log

Configuration

Config map: cm.yaml.txt Pod: pod.yaml.txt

Fluent Bit Log Output

fluent-bit debug logs are here: fluent-bit.log

Fluent Bit Version Info

v1.8.11

We do not see the issue on v1.7.9

Cluster Details

  • do you use App Mesh or a service mesh? No

  • does you use VPC endpoints in a network restricted VPC? No

  • Is throttling from the destination part of the problem? Please note that occasional transient network connection errors are often caused by exceeding limits. For example, CW API can block/drop Fluent Bit connections when throttling is triggered. Not sure, but seems unlikely given how consistent it is and how low of log volume it is.

  • ECS or EKS EKS

  • Fargate or EC2 EC2

  • Daemon or Sidecar deployment for Fluent Bit Daemon

Application Details

Steps to reproduce issue

  1. Run fluent-bit@1.8.11 with the provided config and wait
  2. OOMKills on the fluent-bit pod

Related Issues

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 27 (15 by maintainers)

Most upvoted comments

@dylanlingelbach Yea we think that somewhere in the 1.8.x series this problem began