aws-for-fluent-bit: OOMs/crashes when using fluent-bit 1.8.11
Describe the question/issue
We are seeing constant OOMKills when upgrading to fluent-bit@1.8.7+ - most recently tried on fluent-bit@1.8.11
These OOMs happen within minutes of the fluent-bit pod starting and we see them without significant log traffic - our app is just logging health checks and status messages.
It does not appear to be a slow memory leak - we do not see memory climb slowly and then reach out pod limit. Rather memory is low and the pod receives an OOMKill.
When downgrading to fluent-bit@1.7.9 we do not see these OOMKills and fluent-bit is able to run fine.
If I remove the cloudwatch_logs output, I do not see the OOMKills.
However fluent-bit CPU usage is significantly higher in v1.8.11 vs v1.7.9 with both cloudwatch_logs and without cloudwatch_logs
1.7.9
~$ kubectl top pods -n logging --containers --use-protocol-buffers
POD NAME CPU(cores) MEMORY(bytes)
fluent-bit-49gpm fluent-bit 6m 17Mi
fluent-bit-59kcv fluent-bit 7m 15Mi
fluent-bit-62wzg fluent-bit 7m 12Mi
fluent-bit-djnbn fluent-bit 6m 11Mi
fluent-bit-frfnx fluent-bit 6m 15Mi
fluent-bit-tg7ng fluent-bit 6m 12Mi
fluent-bit-zngds fluent-bit 5m 16Mi
1.8.11
~/Code/catalytic-web (release/iqt-1.1.0|REBASE-i 1/1)$ kubectl top pods -n logging --containers --use-protocol-buffers
POD NAME CPU(cores) MEMORY(bytes)
fluent-bit-czkhw fluent-bit 944m 23Mi
fluent-bit-frfl7 fluent-bit 678m 16Mi
fluent-bit-nzz7m fluent-bit 1075m 16Mi
fluent-bit-t26j4 fluent-bit 717m 26Mi
See https://github.com/fluent/fluent-bit/issues/4192 for more discussion
Logs sent by our system during a few minute stretch can be found here: logs.log
Configuration
Config map: cm.yaml.txt Pod: pod.yaml.txt
Fluent Bit Log Output
fluent-bit debug logs are here: fluent-bit.log
Fluent Bit Version Info
v1.8.11
We do not see the issue on v1.7.9
Cluster Details
-
do you use App Mesh or a service mesh? No
-
does you use VPC endpoints in a network restricted VPC? No
-
Is throttling from the destination part of the problem? Please note that occasional transient network connection errors are often caused by exceeding limits. For example, CW API can block/drop Fluent Bit connections when throttling is triggered. Not sure, but seems unlikely given how consistent it is and how low of log volume it is.
-
ECS or EKS EKS
-
Fargate or EC2 EC2
-
Daemon or Sidecar deployment for Fluent Bit Daemon
Application Details
Steps to reproduce issue
- Run
fluent-bit@1.8.11with the provided config and wait - OOMKills on the
fluent-bitpod
Related Issues
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 27 (15 by maintainers)
@dylanlingelbach Yea we think that somewhere in the 1.8.x series this problem began