telegraf: Telegraf stops publishing metrics to InfluxDB; All plugins take too long to collect

Bug report

After a seemingly random amount of time, Telegraf stops publishing metrics to InfluxDB over UDP. I have been experiencing this issue since Nov 2016 on both Telegraf 1.3.x and 1.4.4 on FreeBSD, on three separate servers. In the telegraf log, all collectors start to fail with

Error in plugin [inputs.$name]: took longer to collect than collection interval (10s)

I can’t see anything unusual or interesting published from the Telegraf internal metrics.

This same issue has been reported in #3318, #2183, #2919, #2780 and #2870 but all those issues are either abandoned by the requestor, or confused with several separate issues; I am pening a new issue for my specific problem but if it’s a duplicate (#3318 seems to be the closest) then please feel free to close

Relevant telegraf.conf:

telegraf.conf

System info:

Telegraf v1.4.4 (git: unknown unknown) running on FreeBSD 10.3-RELEASE-p24

Steps to reproduce:

service telegraf start
wait $random time period

Expected behavior:

Telegraf publishes metrics to InfluxDB server over UDP

Actual behavior:

Telegraf stops publishing metrics seemingly randomly, all input plugins start to fail with:

2018-01-02T17:16:50Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.apache]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.internal]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.swap]: took longer to collect than collection interval (10s)
2018-01-02T17:16:52Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-02T17:16:58Z E! Error: statsd message queue full. We have dropped 1 messages so far. You may want to increase allowed_pending_messages in the config
2018-01-02T17:17:00Z E! Error in plugin [inputs.phpfpm]: took longer to collect than collection interval (10s)
2018-01-02T17:17:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (10s)

Additional info:

Full logs and stack trace

Earlier occurrence

Grafana snapshot of Internal Telegraf metrics

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 31 (6 by maintainers)

Most upvoted comments

I wouldn’t expect socket_listener to complain since it is a “service input”. This means that instead of being called each interval it is event driven: it adds new metrics when it receives on on its socket.

I suspect even though there is no log message that you cannot send items to the socket_listener and have them delivered to an output.

danielnelson on Jan 9, 2018