aws-for-fluent-bit: [all versions] exec input SIGSEGV/crash due unitialized memory [fix in 2.31.12]

Describe the question/issue

Hi folks,

We’ve been basing our EKS logging infrastructure on AWS-for-fluent-bit. However, recently, we’ve been noticing some Fluent Bit pods crash with a SIGSEGV on startup and go into a CrashLoopBackoff loop on deployment. Redoing the deployment leads to the same problem on the very same physical hosts while other pods elsewhere on other hosts in the same cluster run fine. The EKS workers are configured identically, so it’s a bit of a head scratcher as to why this is happening on a handful of random nodes in a cluster and that too persistently while the majority of the pods are running fine.

If we let the pod retries run long enough on a host, eventually, it will succeed but that can take anywhere from an hour to a day, unacceptable for a production environment.

Configuration

Please find the config map for aws-for-fluent-bit attached. This contains the fluent bit config file as well. Note that the daemonset is running in its own namespace (“logging”). The namespace is attached to a dedicated service account, “fb-service-account”

aws-for-fluent-bit-conf.txt

Fluent Bit Log Output

DebugLog.txt

Fluent Bit Version Info

AWS For Fluent Bit Image: 2.31.11 though we’ve also seen the same behavior for 2.31.10 and 2.31.6 For debugging , we used debug-2.31.11.

Cluster Details

EKS cluster: K8s version 1.25 (v1.25.9-eks-0a21954) though we’ve also noticed this for earlier versions Instance types: mostly r6gd.16xl and the occasional r5d.16xlarge We base our internal AMIs on the following images:

  • arm64 - ami-0aa7aa4c87fe47ff6
  • amd64 - ami-071432800334eb200

Application Details

This occurred on an essentially idle cluster with no applications running. Fluent Bit crashes immediately on startup, so load wasn’t a factor.

Steps to reproduce issue

We reviewed the suggestions provided in:

We generated the debug log as follows:

  • on a machine where FB was repeatedly crashing , we copied over the fluent-bit config files to /home/myname/etc
  • docker run -it --entrypoint=/bin/bash --ulimit core=-1 -v /var/log:/var/log -v /home/sacharya:/cores -v $(pwd):/fluent-bit/etc public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11
  • Note that while the image name is the same,
  • once in pod shell, we had to install some libraries to get rid of gdb warnings:
yum -y install yum-utils
debuginfo-install bzip2-libs-1.0.6-13.amzn2.0.3.aarch64 cyrus-sasl-lib-2.1.26-24.amzn2.aarch64 elfutils-libelf-0.176-2.amzn2.aarch64 elfutils-libs-0.176-2.amzn2.aarch64 keyutils-libs-1.5.8-3.amzn2.0.2.aarch64 krb5-libs-1.15.1-55.amzn2.2.5.aarch64 libcap-2.54-1.amzn2.0.1.aarch64 libcom_err-1.42.9-19.amzn2.0.1.aarch64 libcrypt-2.26-63.amzn2.aarch64 libgcc-7.3.1-15.amzn2.aarch64 libgcrypt-1.5.3-14.amzn2.0.3.aarch64 libgpg-error-1.12-3.amzn2.0.3.aarch64 libselinux-2.5-12.amzn2.0.2.aarch64 libyaml-0.1.4-11.amzn2.0.2.aarch64 lz4-1.7.5-2.amzn2.0.1.aarch64 pcre-8.32-17.amzn2.0.2.aarch64 systemd-libs-219-78.amzn2.0.22.aarch64 xz-libs-5.2.2-1.amzn2.0.3.aarch64 zlib-1.2.7-19.amzn2.0.2.aarch64

yum remove openssl-debuginfo-1:1.0.2k-24.amzn2.0.6.aarch64
debuginfo-install openssl11-libs-1.1.1g-12.amzn2.0.13.aarch64
  • and then, finally:
export FLB_LOG_LEVEL=debug
gdb /fluent-bit/bin/fluent-bit
r -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/etc/fluent-bit.conf

Note that we were able to manually run FB successfully via docker on the other hosts in the cluster where FB pods had been successfully launched by our usual (terraform/helm based) deployment process.

Related Issues

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 80 (40 by maintainers)

Most upvoted comments

@SoamA Sorry we’ve been having issues with our release automation… I’m working on getting it out ASAP.

@SoamA thanks!! I’m taking a look…

I uploaded just uploaded a tarball, fluent-bit-exec-and-core.tar, to the usual place. This contains a core file and the fluent bit binary used to generate that core file. Let me know if you have more luck getting to read this one!

@SoamA here is my update on this issue investigation for today. I have a number of issues I am working on concurrently. Thank you for continuing to work with me on this.

Still unable to read new core files - need export of binary

Output is similar to here: https://github.com/aws/aws-for-fluent-bit/issues/661#issuecomment-1563473815

I need the exact binary used to produce the core in order to read it.

If you could export the image or binary that you are using and upload it in the ticket, that’d help.

One option would be to export the image:

Alternatively, you can use the steps I ran here to grab the binary from the running image: https://github.com/aws/aws-for-fluent-bit/issues/661#issuecomment-1563473815

Then please upload it to your ticket if you can. Thanks!

Commentary on Debugging with GDB - we need a core from the crash

BTW, not sure if this will be useful for you or not given how this messes with entry points but i opened up a shell into a debug-2.31.11 image, set up and ran fluent bit va gdb exactly as https://github.com/aws/aws-for-fluent-bit/issues/661#issue-1720938084. This time, however, I used backtrace in gdb after it crashed to see if it’d generate a stack trace:

Can you please give me the exact steps and commands you followed? I am confused- was the core stacktrace obtained with a shell into the image (because then it couldn’t have crashed yet)?

With cores and gdb, there are two main ways of debugging, which we cover here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/tutorials/remote-core-dump/README.md

  1. Live Fluent Bit: You can generate a core or a stacktrace for the current state of the program. This is useful when we think there is some sort of hanging or deadlock issue. However, if Fluent Bit later crashes, then a core generated before it crashes is unlikely to provide value.
  2. Crashed Fluent Bit: If it crashes with SIGSEGV, SIGBUS, SIGABRT and it dumps a core, then we can get a stacktrace that may explain why it crashed. However, GDB can only analyze this if it has the exact binary AFAIK.

Hope that helps!