telegraf: New snmp plugin a bit slow

I have a few problems with performance of the new SNMP plugin.

When doing snmpwalk of EtherLike-MIB::dot3StatsTable and IF-MIB::ifXTable and IF-MIB::ifTable on a Cisco router they complete in ~2, ~3 and ~3.3 seconds respectively (8.3 sec combined +/- 10%).

When polling with the snmp plugin it takes 17-19 seconds for a single run.

I’m unsure if the snmp plugins polls every host in parallel or in sequence. I only have one host to test against and even when I put each of the three tables in separate [[inputs.snmp]] sections they are polled sequentially and not in parallel.

Our needs are polling hundreds of devices with hundreds of interfaces every 5 or 10 seconds (which collectd and libsnmp does easily).

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 3
Comments: 50 (27 by maintainers)

Commits related to this issue

Default SNMP parameter changes max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: htt... — committed to influxdata/telegraf by sparrc 8 years ago
Default SNMP parameter changes max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: htt... — committed to influxdata/telegraf by sparrc 8 years ago
Default SNMP parameter changes max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: htt... — committed to influxdata/telegraf by sparrc 8 years ago
Default SNMP parameter changes max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: htt... — committed to influxdata/telegraf by sparrc 8 years ago
Default SNMP parameter changes max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: htt... — committed to influxdata/telegraf by sparrc 8 years ago
Default SNMP parameter changes max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: htt... — committed to influxdata/telegraf by sparrc 8 years ago
Default SNMP parameter changes max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: htt... — committed to viareport/telegraf by sparrc 8 years ago

Most upvoted comments

No offense, but why is it that every single ticket that is opened that mentions the snmp plugin gets an advertisement for snmpcollector?

phemmer on May 3, 2017

Thanks @danielnelson Just tried the latest nightly build and it is a huge improvment

Time taken to poll 10 device on latest nightly real 0m4.696s user 0m0.363s sys 0m0.098s

Time taken to poll same 10 devices with stable

real 0m16.728s user 0m0.377s sys 0m0.116s

I will add more devices and report times

ayounas on Oct 26, 2017

@toni-moreno the gosnmp plugin is built in such a way that each remote server is hardcoded into the base object. Looking at your parallel scripts like this, you’re only attempting to poll 1 device (and the loopback address at that) and mainly just parallelize the oid walks that you’re doing. What this plugin is attempting to leverage is the polling of hundreds of different devices with potentially different OIDs. (My original use case had 500+ devices pulling hundreds of similar OIDs each)

This leaves us with 2 choices:

Parallel instantiate hundreds of gosnmp instances
Serially reset the underlying gosnmp data to the next device and move on

As to @phemmer 's concerns, I don’t have a GREAT understanding of the underlying gosnmp library and would prefer he address that. I’m reading through but I see some areas where you’ll see parallelized slowdowns and wait times for sending out requests such as the sendOneRequest() function and the send() function in marshal.go. There’s a full pause, wait for retries, and check only at the beginning of the loop function for exceeding the retry timer.

Honestly, I don’t know the best way to solve this use case. Appreciate input from both of you.

Will-Beninger on Sep 1, 2017

@phemmer My work situation has changed and I’m considering contributing to the project in my free time. I’m still seeing a notification every few weeks on this so it’s apparently still an issue.

I’m able to open a PR and contribute my earlier code (once I’ve updated it) that “fixed” some of the parallelization issues we saw. Are you okay with proceeding with it as a workaround until the goSNMP project can be fixed?

I started deep-diving the goSNMP project and it’s a bit of a mess. It almost needs to be rebuilt from the RFCs up. Interested in how you’d recommend tackling it.

Will-Beninger on Sep 1, 2017

@willemdh I would open up a new issue. Your problem is not what this ticket is about. I would also suspect your config is a lot more complex than what you show, as the config you provided cannot account for that much CPU usage.

phemmer on May 3, 2017

Hi @StianOvrevage we are working in a snmp colector tool for influxdb that has a good behaviour with lots of metrics.

Its different from telegraph because it is focused only on snmp devicss and It has also a web-ui interface which help us to configure in a easy way.

Perhaps would you like to test its performance.

https://github.com/toni-moreno/snmpcollector

Thank you and sorry for the spam

toni-moreno on Nov 22, 2016