aws-for-fluent-bit: [Datadog output] Version 2.29.0 Causing Task to stop
When updating to version 2.29.0 (previously 2.28.4) of aws-observability/aws-for-fluent-bit we are observing one of our task definitions entering a cycle of provisioning and de-provisioning.
We are running ECS with Fargate and aws-observability/aws-for-fluent-bit plus datadog/agent version 7.40.1 as sidecars.
We have not had an opportunity to look into the cause of this. Hopefully, you can provide some insights into how we can debug this further. Our next steps will likely be to try the FLB_LOG_LEVEL=debug environment variable and report back.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 99
- Comments: 85 (38 by maintainers)
Oh this made my day. Restoring to 2.28.4 helps.
Same error since 2.29.0: [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate… [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
platform: arm(graviton)
Quickfix -> switch to stable tag
Same here, also on ECS/Fargate with datadog-agent 7.40.1.
Same for us. We had an hour long outage trying to diagnose this. Using
stablefixed for us. Same setup as everyone else with fargate and datadogJust wanted to add my own results… We had the issue of Fargate Tasks recycling over and over because the ‘aws-for-fluent-bit:2.29.0’ container was marked as essential and would crash with the ‘Cannot allocate memory’ line. We are configured to send logs to DataDog.
Just updated to 2.30.0 and all is well… container starts, forwards the logs, and Tasks have been healthy for over an hour.
Thank you for the information. It’s very helpful in isolating the issue to the following commit:
And also shows that the issue is not resolvable with:
The stacktraces from the logs, show a segfault within a network call, however this memory could be corrupted elsewhere and only triggered on network activity. After reading the code thoroughly and testing more, I’m still not sure what could be causing this corruption.
@matthewfala sorry I don’t have any good logs, but the log container was throwing exit code 139, which I believe means there was a segfault (and hence no related logs showed up in datadog)
Apologies it’s the weekend here in Australia, I’ll talk to our DevOps about sharing the task definition on Monday.
We are experiencing the same problem, we are not using datadog-agent but the fluent bit task seems to stop randomly after 15-60 minutes. Switched back to 2.28.4 did resolve the issue.
Hey @matthewfala - thanks for the arm64 build. I’ve been OoO today so I haven’t had opportunity to test it yet, but do I plan to try it next week when I’m back in.
Is anyone available to help test out the following image with the datadog fix?
If that image still has problems, we made a set of images that progressively build on 2.28.4 the recent fluent bit commits between 2.28.4 and 2.29.0. If we know which of the following images work or fail, it would greatly help us to isolate the fault.
Here are some custom test images:
Thank you for everyone’s help in identifying and resolving this problem. Look forward to merging in the solution once we validate it.
Please test with this image
public.ecr.aws/clay-cheng/amazon/aws-for-fluent-bit:2.29.0-datadog-revertin your pre-prod/test stage. Thank you!This image reverts the datadog fix in 1.9.10 which 2.29.0 is based on: https://github.com/fluent/fluent-bit/releases/tag/v1.9.10
That change is the only difference in the datadog code between 2.28.4 and 2.29.0.
Please note: This is not a fix, it is just a hypothesis that AWS engineers are testing. You can help us by testing in your pre-prod/test stage. For prod, we recommend 2.28.4 since users in this issue have reported it fixed their problems. However, at this time AWS does not have a root cause for this issue.
Is there any possibility you will promote 2.29.0 to
stablewhile this is under investigation? We moved to thestabletag and are wondering if we should pin2.28.4.Are you all using the Fluent Bit datadog output? Can you please share your Fluent Bit configuration files?
Same problem, also on Fargate with datadog-agent.
Same for us. We retrieved this in our logs: