fluent-bit: DNS resolution timeout/failure in >= 1.8.5
Bug Report
Describe the bug
Hi, I am facing a DNS resolution timeout/failure since upgrading to >= 1.8.5 with the forward module to a fluentd instance. It is working fine with 1.8.4. I am running on ubuntu 20.04 and the local resolver accept UDP and TCP requests. I tried to set net.dns.mode UDP but it changes nothing. I am guessing there might be an issue with 1.8.5 and the changes to DNS resolution library. I still have the same error when setting the upstream to www.google.com.
To Reproduce I have replaced the real fluentd hostname with fluentd.example.org in this log
[2021/09/03 11:27:49] [ info] [engine] started (pid=2320367)
[2021/09/03 11:27:49] [ info] [storage] version=1.1.1, initializing...
[2021/09/03 11:27:49] [ info] [storage] root path '/var/td-agent-bit/storage'
[2021/09/03 11:27:49] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/09/03 11:27:49] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/09/03 11:27:49] [ info] [cmetrics] version=0.2.1
[2021/09/03 11:27:49] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 95.4M
[2021/09/03 11:27:55] [ info] [http_server] listen iface=127.0.0.1 tcp_port=2020
[2021/09/03 11:27:55] [ info] [sp] stream processor started
[2021/09/03 11:27:55] [ info] [input:tail:tail_proftpd_log] inotify_fs_add(): inode=516227 watch_fd=1 name=/var/log/proftpd/commandsAsJson.log
[2021/09/03 11:27:55] [ info] [input:tail:tail_history_log] inotify_fs_add(): inode=260310 watch_fd=1 name=/var/td-agent-bit/input/commandsAsJson.history.log
[2021/09/03 11:30:34] [ warn] [net] getaddrinfo(host='fluentd.example.org', err=12): Timeout while contacting DNS servers
[2021/09/03 11:30:34] [error] [output:forward:forward_to_fluentd] no upstream connections available
[2021/09/03 11:30:34] [ warn] [engine] failed to flush chunk '2320367-1630661431.54171690.flb', retry in 10 seconds: task_id=0, input=tail_proftpd_log > output=forward_to_fluentd (out_id=0)
- Steps to reproduce the problem:
# Output all logs to fluentd instances
[OUTPUT]
Name forward
Alias forward_to_fluentd
Match das.scanner.*
Upstream upstream.conf
Retry_Limit False
tls on
[UPSTREAM]
name forward-balancing
[NODE]
name fluentd
host fluentd.example.org
port 24224
tls on
Expected behavior Messages should be sent to the upstream fluentd service.
Your Environment
- Version used: Failed in 1.8.5 and 1.8.6. Works in 1.8.4
- Server type and version: Public cloud VM
- Operating System and version: Ubuntu 20.04
- Filters and plugins: grep and nest
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 20
- Comments: 31 (9 by maintainers)
Commits related to this issue
- output: initialize network defaults for output instances (#4050) For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no keep aliv... — committed to fluent/fluent-bit by edsiper 3 years ago
- output: initialize network defaults for output instances (#4050) (#4088) For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no ... — committed to fluent/fluent-bit by edsiper 3 years ago
- output: initialize network defaults for output instances (#4050) (#4088) For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no ... — committed to fluent/fluent-bit by edsiper 3 years ago
- output: initialize network defaults for output instances (#4050) (#4088) For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no ... — committed to pwhelan/fluent-bit by edsiper 3 years ago
- fix: Downgrade to fluenbit 1.8.4 Due to https://github.com/fluent/fluent-bit/issues/4050 — committed to Geocodio/docker-fluentbit-docker-client by MiniCodeMonkey 2 years ago
Hey folks, here is the results of my repro attempts. I was able to confirm this issue report, at least for the datadog output.
Base Config
I tested 4 different outputs:
net.dns.modesetting in versions that support it).Testing Env
Just my Mac on my local home network, running fluent bit in Docker.
Versions Tested
This confirms the issue is in 1.8.5+, at least for the datadog output.
For AWS for Fluent Bit customers, see our release page to map fluent bit versions to our versions: https://github.com/aws/aws-for-fluent-bit/releases
✅ == DNS resolution works. ❌ == DNS resolution failed. I specifically saw this message:
It’s not startup error, the errors popping continuously
With the 1.8.15 version of fluent bit image, we are seeing the DNS errors with the es output plugin. Below is the sample log from fluent bit pod.
When I tried with the 1.8.4 fluent bit image, didn’t see any errors related to DNS.
I am also seeing this and previously reported against https://github.com/aws/aws-for-fluent-bit/issues/233 when they bumped their version from fluent-bit 1.8.3 -> 1.8.6. I’ll add to this report that it is definitely not a host networking issue as
nslookupis able to resolve the hostnames with no issues when SSH’d directly to these containers. Thanks.Close,
apt-get -y install td-agent-bit=1.8.4will do the trick!we are seeing this in our environment as well. as others have mentioned, downgrading to 1.8.4 fixes the problem.