fluent-bit: DNS resolution timeout/failure in >= 1.8.5

Bug Report

Describe the bug Hi, I am facing a DNS resolution timeout/failure since upgrading to >= 1.8.5 with the forward module to a fluentd instance. It is working fine with 1.8.4. I am running on ubuntu 20.04 and the local resolver accept UDP and TCP requests. I tried to set net.dns.mode UDP but it changes nothing. I am guessing there might be an issue with 1.8.5 and the changes to DNS resolution library. I still have the same error when setting the upstream to www.google.com.

To Reproduce I have replaced the real fluentd hostname with fluentd.example.org in this log

[2021/09/03 11:27:49] [ info] [engine] started (pid=2320367)
[2021/09/03 11:27:49] [ info] [storage] version=1.1.1, initializing...
[2021/09/03 11:27:49] [ info] [storage] root path '/var/td-agent-bit/storage'
[2021/09/03 11:27:49] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/09/03 11:27:49] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/09/03 11:27:49] [ info] [cmetrics] version=0.2.1
[2021/09/03 11:27:49] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 95.4M
[2021/09/03 11:27:55] [ info] [http_server] listen iface=127.0.0.1 tcp_port=2020
[2021/09/03 11:27:55] [ info] [sp] stream processor started
[2021/09/03 11:27:55] [ info] [input:tail:tail_proftpd_log] inotify_fs_add(): inode=516227 watch_fd=1 name=/var/log/proftpd/commandsAsJson.log
[2021/09/03 11:27:55] [ info] [input:tail:tail_history_log] inotify_fs_add(): inode=260310 watch_fd=1 name=/var/td-agent-bit/input/commandsAsJson.history.log
[2021/09/03 11:30:34] [ warn] [net] getaddrinfo(host='fluentd.example.org', err=12): Timeout while contacting DNS servers
[2021/09/03 11:30:34] [error] [output:forward:forward_to_fluentd] no upstream connections available
[2021/09/03 11:30:34] [ warn] [engine] failed to flush chunk '2320367-1630661431.54171690.flb', retry in 10 seconds: task_id=0, input=tail_proftpd_log > output=forward_to_fluentd (out_id=0)
  • Steps to reproduce the problem:
# Output all logs to fluentd instances
[OUTPUT]
    Name forward
    Alias forward_to_fluentd
    Match das.scanner.*
    Upstream upstream.conf
    Retry_Limit False
    tls on
[UPSTREAM]
    name    forward-balancing
[NODE]
    name    fluentd
    host    fluentd.example.org
    port    24224
    tls     on

Expected behavior Messages should be sent to the upstream fluentd service.

Your Environment

  • Version used: Failed in 1.8.5 and 1.8.6. Works in 1.8.4
  • Server type and version: Public cloud VM
  • Operating System and version: Ubuntu 20.04
  • Filters and plugins: grep and nest

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 20
  • Comments: 31 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Hey folks, here is the results of my repro attempts. I was able to confirm this issue report, at least for the datadog output.

Base Config

I tested 4 different outputs:

  • cloudwatch_logs
  • es (sending to Amazon OpenSearch, and I also sourced credentials from the STS API, so two AWS service endpoints are called)
  • http (sending to google.com, because I’m just trying to test DNS resolution)
  • Datadog Starting with this configuration I made some simple modifications in each test (results below, the only modification was adding the net.dns.mode setting in versions that support it).
[INPUT]
    Name dummy
    Tag dummy

[OUTPUT]
    Name datadog
    Match *
    Host http-intake.logs.datadoghq.com
    TLS On
    apikey  REDACTED
    dd_service my-test-service-dns-issue
    dd_source fluent-bit
    dd_tags project:example
    provider ecs


[OUTPUT]
    Name  http
    Match *
    Host  google.com
    Port  80
    URI   /

[OUTPUT]
    Name  es
    Match *
    Host  REDACTED
    Port  443
    Index my_index
    Type  my_type
    AWS_Auth On
    AWS_Region us-west-2
    tls     On
    AWS_Role_Arn REDACTED

[OUTPUT]
    Name cloudwatch_logs
    Match   *
    region us-east-1
    log_group_name fluent-bit-cloudwatch
    log_stream_prefix from-fluent-bit-
    auto_create_group On
    net.dns.mode TCP

Testing Env

Just my Mac on my local home network, running fluent bit in Docker.

Versions Tested

This confirms the issue is in 1.8.5+, at least for the datadog output.

For AWS for Fluent Bit customers, see our release page to map fluent bit versions to our versions: https://github.com/aws/aws-for-fluent-bit/releases

✅ == DNS resolution works. ❌ == DNS resolution failed. I specifically saw this message:

[2021/09/14 01:23:09] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.com', err=12): Timeout while contacting DNS servers
cloudwatch_logs datadog http (google.com) es (send to Amazon OpenSearch)
1.8.0
1.8.1
1.8.2
1.8.3
1.8.4
1.8.5
1.8.5 with TCP DNS
1.8.5 with UDP DNS
1.8.6
1.8.6 with TCP DNS
1.8.6 with UDP DNS

It’s not startup error, the errors popping continuously

With the 1.8.15 version of fluent bit image, we are seeing the DNS errors with the es output plugin. Below is the sample log from fluent bit pod.

[2022/04/27 17:47:00] [ warn] [net] getaddrinfo(host=xxxx', err=12): Timeout while contacting DNS servers
[2022/04/27 17:47:01] [ warn] [http_client] cannot increase buffer: current=512000 requested=544768 max=512000
[2022/04/27 17:47:07] [ info] [input:tail:tail.1] inode=188774993 handle rotation(): /var/log/containers/fluent-bit-5lddk_istio-system_istio-proxy-d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b.log => /var/lib/docker/containers/d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b/d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b-json.log.4

When I tried with the 1.8.4 fluent bit image, didn’t see any errors related to DNS.

I am also seeing this and previously reported against https://github.com/aws/aws-for-fluent-bit/issues/233 when they bumped their version from fluent-bit 1.8.3 -> 1.8.6. I’ll add to this report that it is definitely not a host networking issue as nslookup is able to resolve the hostnames with no issues when SSH’d directly to these containers. Thanks.

Close, apt-get -y install td-agent-bit=1.8.4 will do the trick!

we are seeing this in our environment as well. as others have mentioned, downgrading to 1.8.4 fixes the problem.