aws-for-fluent-bit: Seeing `Timeout while contacting DNS servers` with latest v2.19.1

šŸ‘‹

I’m seeing a stream of errors like:

[2021/09/01 21:46:21] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.com', err=12): Timeout while contacting DNS servers

when my ECS tasks have picked up the latest SHA.

I can confirm by SSH’ing to the EC2 ECS docker host and docker exec on to the logrouter instance that it is definitely able to both DNS resolve http-intake.logs.datadoghq.com very quickly and can get data back from that server.

~Are there any published tags in public.ecr.aws/aws-observability/aws-for-fluent-bit for previous releases I could pin to to mitigate the issue in the short-term?~ Ah, I see the https://github.com/aws/aws-for-fluent-bit#versioning-faq section, I’ll pin to 2.19.0 for now.

Thank you!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 67
  • Comments: 31 (14 by maintainers)

Most upvoted comments

I can confirm that 2.19.0 works fine, the issue is in 2.19.1.

I also hit this issue with latest fluentbit as well, 2.19.0 works well.

Just want to share, for folks that might just be joining on here and finding this later, this repo actually maintains a :stable docker tag that has some solid vetting. The official AWS docs all tell us to use :latest, but I would recommend that if you’re making a change in your infra to ā€œpinā€ to rollback, I’d actually recommend using the :stable tag instead of pinning to 2.19.0. ( See https://github.com/aws/aws-for-fluent-bit/blob/mainline/README.md#using-the-stable-tag )

We found that it was a fair bit of work to rollout to all our infra, and we don’t love having to repeat that, so we ended up choosing to switch from :latest to :stable as a one-stop shop.

Fluent Bit 1.8.7 has fixed this issue: https://fluentbit.io/announcements/v1.8.7/

We need to release it in AWS for Fluent Bit.

We discussed with upstream yesterday. Upstream has identified the root cause - if plugins do not implement the config_map interface, it will caused plugin’s net_setup not to be initialized. A solution is provided - output: initialize network defaults for output instances. - fluent/fluent-bit#4050 (comment). .

Once upstream has released the patch, AWS will provide a new release.

Hi @jagnk @tai-acall,

Sorry for the wait. We were busy with with another release which upgraded golang version in our image. It was done yesterday and I will work on a new release today to include fluent bit 1.8.7.

Thanks for the patience. I will let you all know a new image is available.

InvalidParameterException: Log event too large: 262146 bytes exceeds limit of 262144\n"

@Funkerman1992 Please open another issue for this. This looks like a bug, somehow our calculation of the event size in the payload is wrong.

If you can, please provide full details in that issue on your config, and how to repro the error.

@Funkerman1992

[ warn] [input] tail.0 paused (mem buf overlimit)

This means you’re producing logs faster than fluent bit can read them and store them in its buffer. So you need to increase Mem_buf_limit: https://docs.fluentbit.io/manual/administration/buffering-and-storage

chunk ā€˜1-1633129507.213824356.flb’ cannot be retrie

This is a failed retry. If it fails to retry a chunk, then some of your logs will be lost. There should be other errors in your logs, which explain why the retry failed.

Hi all,

aws-for-fluent-bit 2.20.0 is out: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.20.0. It includes fluent bit 1.8.7 and should fix this issue. Please try the latest image to see if your problem has been resolved. Thanks for the patience!

@PettitWesley @tai-acall you could easily fix it by using the stable docker image instead of latest.

@PettitWesley Same issue here @jagnk , it breaks our ECS infrastructure.

Release ASAP will be huge help.

Hi @PettitWesley, any idea when the new version is going to be bumped? This is affecting our ability to have logs in production.

We hit this issue as well. I can confirm that pinning to 2.19.0 does fix it.

Thanks for spending time with us @chrisgray-vertex.

It seems like a DNS issue in our upstream. @magichair could you please open an issue in our upstream: https://github.com/fluent/fluent-bit to let the upstream maintainer notice this? We will also talk to them about this one. Thanks!