aws-for-fluent-bit: [Datadog output] Version 2.29.0 Causing Task to stop

When updating to version 2.29.0 (previously 2.28.4) of aws-observability/aws-for-fluent-bit we are observing one of our task definitions entering a cycle of provisioning and de-provisioning.

We are running ECS with Fargate and aws-observability/aws-for-fluent-bit plus datadog/agent version 7.40.1 as sidecars.

We have not had an opportunity to look into the cause of this. Hopefully, you can provide some insights into how we can debug this further. Our next steps will likely be to try the FLB_LOG_LEVEL=debug environment variable and report back.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 99
  • Comments: 85 (38 by maintainers)

Most upvoted comments

Oh this made my day. Restoring to 2.28.4 helps.

Same error since 2.29.0: [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate… [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory

platform: arm(graviton)

Quickfix -> switch to stable tag

Same here, also on ECS/Fargate with datadog-agent 7.40.1.

Same for us. We had an hour long outage trying to diagnose this. Using stable fixed for us. Same setup as everyone else with fargate and datadog

Just wanted to add my own results… We had the issue of Fargate Tasks recycling over and over because the ‘aws-for-fluent-bit:2.29.0’ container was marked as essential and would crash with the ‘Cannot allocate memory’ line. We are configured to send logs to DataDog.

Just updated to 2.30.0 and all is well… container starts, forwards the logs, and Tasks have been healthy for over an hour.

1/27/2023, 11:19:02 AM	AWS for Fluent Bit Container Image Version 2.30.0	log-router
1/27/2023, 11:19:02 AM	Fluent Bit v1.9.10	log-router
1/27/2023, 11:19:02 AM	* Copyright (C) 2015-2022 The Fluent Bit Authors	log-router
1/27/2023, 11:19:02 AM	* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd	log-router
1/27/2023, 11:19:02 AM	* https://fluentbit.io	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [fluent bit] version=1.9.10, commit=6345dd7422, pid=1	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [cmetrics] version=0.3.7	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [output:null:null.0] worker #0 started	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [sp] stream processor started	log-router

Thank you for the information. It’s very helpful in isolating the issue to the following commit:

  • out_datadog: fix/add error handling for all flb_sds calls

And also shows that the issue is not resolvable with:

  • datadog: resolve tag buffer resize bug

The stacktraces from the logs, show a segfault within a network call, however this memory could be corrupted elsewhere and only triggered on network activity. After reading the code thoroughly and testing more, I’m still not sure what could be causing this corruption.

@matthewfala sorry I don’t have any good logs, but the log container was throwing exit code 139, which I believe means there was a segfault (and hence no related logs showed up in datadog)

Apologies it’s the weekend here in Australia, I’ll talk to our DevOps about sharing the task definition on Monday.

We are experiencing the same problem, we are not using datadog-agent but the fluent bit task seems to stop randomly after 15-60 minutes. Switched back to 2.28.4 did resolve the issue.

Hey @matthewfala - thanks for the arm64 build. I’ve been OoO today so I haven’t had opportunity to test it yet, but do I plan to try it next week when I’m back in.

Is anyone available to help test out the following image with the datadog fix?

2.29.0-datadogfix : 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:2.29.0-datadogfix

If that image still has problems, we made a set of images that progressively build on 2.28.4 the recent fluent bit commits between 2.28.4 and 2.29.0. If we know which of the following images work or fail, it would greatly help us to isolate the fault.

Here are some custom test images:

2.28.4-and-cwfix : 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:2.28.4-and-cwfix
2.28.4-addall-noaws-addcwfix : 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:2.28.4-addall-noaws-addcwfix
2.28.4-addall-nodatadog-noforward : 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:2.28.4-addall-nodatadog-noforward
2.29.0-datadogfix : 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:2.29.0-datadogfix

Thank you for everyone’s help in identifying and resolving this problem. Look forward to merging in the solution once we validate it.

Please test with this image public.ecr.aws/clay-cheng/amazon/aws-for-fluent-bit:2.29.0-datadog-revert in your pre-prod/test stage. Thank you!

This image reverts the datadog fix in 1.9.10 which 2.29.0 is based on: https://github.com/fluent/fluent-bit/releases/tag/v1.9.10

That change is the only difference in the datadog code between 2.28.4 and 2.29.0.

Please note: This is not a fix, it is just a hypothesis that AWS engineers are testing. You can help us by testing in your pre-prod/test stage. For prod, we recommend 2.28.4 since users in this issue have reported it fixed their problems. However, at this time AWS does not have a root cause for this issue.

Is there any possibility you will promote 2.29.0 to stable while this is under investigation? We moved to the stable tag and are wondering if we should pin 2.28.4.

Are you all using the Fluent Bit datadog output? Can you please share your Fluent Bit configuration files?

Same problem, also on Fargate with datadog-agent.

Same for us. We retrieved this in our logs:

AWS for Fluent Bit Container Image Version 2.29.0
--
Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/12/07 10:10:47] [ info] [fluent bit] version=1.9.10, commit=760956f50c, pid=1
[2022/12/07 10:10:47] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/12/07 10:10:47] [ info] [cmetrics] version=0.3.7
[2022/12/07 10:10:47] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
[2022/12/07 10:10:47] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
[2022/12/07 10:10:47] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
[2022/12/07 10:10:47] [ info] [output:null:null.0] worker #0 started
[2022/12/07 10:10:47] [ info] [sp] stream processor started
[2022/12/07 10:11:00] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2022/12/07 10:11:00] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2022/12/07 10:11:01] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2022/12/07 10:11:01] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2022/12/07 10:11:02] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2022/12/07 10:11:02] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2022/12/07 10:11:32] [engine] caught signal (SIGSEGV)