telegraf: Telegraf 1.20.3 to 1.21.2 failing to startup - mqtt.output fails
Relevent telegraf.conf
[[outputs.mqtt]]
servers = ["tcp://localhost:1883"] # required.
topic_prefix = "telegraf"
qos = 2
client_id = "home_telegraf"
data_format = "json"
System info
Telegraf 1.20.3 and 1.20.4, Debian GNU/Linux 10 (buster) 4.19.0-18-amd64, mosquitto version 2.0.12
Docker
No response
Steps to reproduce
systemd attempts to start the Telegraf service
Expected behavior
Telegraf should start up. Telegraf used to start up before recent updates.
Actual behavior
Telegraf fails to start with the following errors in the log:
Nov 25 15:24:15 home systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Nov 25 15:24:15 home systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Starting Telegraf 1.20.4
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Loaded inputs: cpu disk diskio dns_query ethtool internal kernel mem net netstat nstat openweathermap ping processes procstat (7x) sensors snmp swap system systemd_units wireless x509_cert
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Loaded aggregators:
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Loaded processors:
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Loaded outputs: influxdb mqtt
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Tags enabled: host=home
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"home", Flush Interval:10s
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z E! [agent] Failed to connect to [outputs.mqtt], retrying in 15s, error was 'network Error : dial tcp: lookup tcp on 192.168.1.1:53: no such host'
Nov 25 15:24:30 home telegraf[2417]: 2021-11-25T15:24:30Z E! [telegraf] Error running agent: connecting output outputs.mqtt: Error connecting to output "outputs.mqtt": network Error : dial tcp: lookup tcp on 192.168.1.1:53: no such host
Nov 25 15:24:30 home systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE
Nov 25 15:24:30 home systemd[1]: telegraf.service: Failed with result 'exit-code'.
Nov 25 15:24:30 home systemd[1]: telegraf.service: Service RestartSec=100ms expired, scheduling restart.
Nov 25 15:24:30 home systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 12.
Note that Telegraf is attempting (and failing) to do a DNS lookup to 192.168.1.1 - this is the correct DNS server for this network and works for everything else. However, it should not be doing a DNS lookup anyway since the Mosquitto server is specified using an IP address.
What’s more is that failure to connect to the broker should NOT crash Telegraf.
Additional info
Note that I also tried changing the broker server to ["tcp://127.0.0.1:1883"]
and the error was the same.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (3 by maintainers)
@TotallyInformation : Thank you for the hint. With
keep_alive
it is working for unencrypted connections now.Yeah, I was just able to reproduce after updating to mosquitto 2.0.14!
For 1 - We should at least update the docs to specify that it is not required. I can follow up with that.
For 2 - It looks like this was identified as a part of the https://github.com/influxdata/telegraf/pull/9803 and the docs specifically call out version v2.0.12 that the value must be set. I will update the docs as well that make this more visible and specify it is not just v2.0.12, but later versions as well it seems as I saw it with 2.0.14.
For your edit about Telegraf not falling over: I would not classify this as a config error. In this situation, a connection was unable to be established to an output. As a part of the initialization process, Telegraf will attempt to connect to all outputs and if there are any failures during that step, Telegraf will stop. This is absolutely the intended behavior. I do think we could do better with retries in some situations, but Telegraf should not start running if outputs are not able to be connected to in the first place.