aws-for-fluent-bit: Seeing `Timeout while contacting DNS servers` with latest v2.19.1

👋

I’m seeing a stream of errors like:

[2021/09/01 21:46:21] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.com', err=12): Timeout while contacting DNS servers

when my ECS tasks have picked up the latest SHA.

I can confirm by SSH’ing to the EC2 ECS docker host and docker exec on to the logrouter instance that it is definitely able to both DNS resolve http-intake.logs.datadoghq.com very quickly and can get data back from that server.

~Are there any published tags in public.ecr.aws/aws-observability/aws-for-fluent-bit for previous releases I could pin to to mitigate the issue in the short-term?~ Ah, I see the https://github.com/aws/aws-for-fluent-bit#versioning-faq section, I’ll pin to 2.19.0 for now.

Thank you!

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 67
Comments: 31 (14 by maintainers)

Most upvoted comments

I can confirm that 2.19.0 works fine, the issue is in 2.19.1.

+21

pdiroot on Sep 2, 2021

I also hit this issue with latest fluentbit as well, 2.19.0 works well.

+15

yin-ym on Sep 2, 2021

Just want to share, for folks that might just be joining on here and finding this later, this repo actually maintains a :stable docker tag that has some solid vetting. The official AWS docs all tell us to use :latest, but I would recommend that if you’re making a change in your infra to “pin” to rollback, I’d actually recommend using the :stable tag instead of pinning to 2.19.0. ( See https://github.com/aws/aws-for-fluent-bit/blob/mainline/README.md#using-the-stable-tag )

We found that it was a fair bit of work to rollout to all our infra, and we don’t love having to repeat that, so we ended up choosing to switch from :latest to :stable as a one-stop shop.

+13

magichair on Sep 8, 2021

Fluent Bit 1.8.7 has fixed this issue: https://fluentbit.io/announcements/v1.8.7/

We need to release it in AWS for Fluent Bit.

PettitWesley on Sep 20, 2021

We discussed with upstream yesterday. Upstream has identified the root cause - if plugins do not implement the config_map interface, it will caused plugin’s net_setup not to be initialized. A solution is provided - output: initialize network defaults for output instances. - fluent/fluent-bit#4050 (comment). .

Once upstream has released the patch, AWS will provide a new release.

lubingfeng on Sep 15, 2021

Hi @jagnk @tai-acall,

Sorry for the wait. We were busy with with another release which upgraded golang version in our image. It was done yesterday and I will work on a new release today to include fluent bit 1.8.7.

Thanks for the patience. I will let you all know a new image is available.

zhonghui12 on Sep 30, 2021

InvalidParameterException: Log event too large: 262146 bytes exceeds limit of 262144\n"

@Funkerman1992 Please open another issue for this. This looks like a bug, somehow our calculation of the event size in the payload is wrong.

If you can, please provide full details in that issue on your config, and how to repro the error.

PettitWesley on Oct 4, 2021

@Funkerman1992

[ warn] [input] tail.0 paused (mem buf overlimit)

This means you’re producing logs faster than fluent bit can read them and store them in its buffer. So you need to increase Mem_buf_limit: https://docs.fluentbit.io/manual/administration/buffering-and-storage

chunk ‘1-1633129507.213824356.flb’ cannot be retrie

This is a failed retry. If it fails to retry a chunk, then some of your logs will be lost. There should be other errors in your logs, which explain why the retry failed.

PettitWesley on Oct 4, 2021

Hi all,

aws-for-fluent-bit 2.20.0 is out: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.20.0. It includes fluent bit 1.8.7 and should fix this issue. Please try the latest image to see if your problem has been resolved. Thanks for the patience!

zhonghui12 on Oct 1, 2021

@PettitWesley @tai-acall you could easily fix it by using the stable docker image instead of latest.

smgt on Sep 30, 2021

@PettitWesley Same issue here @jagnk , it breaks our ECS infrastructure.

Release ASAP will be huge help.

tai-acall on Sep 30, 2021

Hi @PettitWesley, any idea when the new version is going to be bumped? This is affecting our ability to have logs in production.

jagnk on Sep 30, 2021

We hit this issue as well. I can confirm that pinning to 2.19.0 does fix it.

kennyjwilli on Sep 7, 2021

Thanks for spending time with us @chrisgray-vertex.

It seems like a DNS issue in our upstream. @magichair could you please open an issue in our upstream: https://github.com/fluent/fluent-bit to let the upstream maintainer notice this? We will also talk to them about this one. Thanks!

zhonghui12 on Sep 2, 2021