node_exporter: Consider implementing timeout for collectors
I’ve just noticed prometheus fail to get metrics from node_exporter in last 2 days (my workstation uptime is 3 days, so node_exporter actually worked only about a day after last reboot). Log of node_exporter complains about “too many open files”, and it’s really hit 1024 open file limit:
# ls /proc/$(pidof node_exporter)/fd | wc -l
1023
But most of them are sockets:
# ls -l /proc/$(pidof node_exporter)/fd | grep socket | wc -l
1018
And I believe all of them except one is leaked because there are no related connections:
# ss -anp | grep node_exporter
tcp LISTEN 0 128 127.0.0.1:9100 *:* users:(("node_exporter",pid=9801,fd=4))
Also I noticed node_exporter uses too many memory, not sure is this related to this issue, but it shouldn’t be second process in memory usage top:
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
1638 powerman 20 0 2514M 1476M 77384 S 0.5 18.6 1h44:02 /usr/bin/firefox
9801 root 20 0 12.5G 432M 2764 S 0.0 5.4 10:53.37 /home/powerman/gocode/bin/node_exporter
I’m running node_exporter as a service under runit using this run file:
#!/bin/sh
exec 2>&1
exec /home/powerman/gocode/bin/node_exporter -web.listen-address 127.0.0.1:9100 \
-collectors.enabled "conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,sockstat,stat,textfile,time,uname,version,vmstat,runit,tcpstat"
I’m now using node_exporter d890b63fb5ebf6144766cd43bf29f0a0e6192491 (about month old) and at a glance next commits doesn’t mention anything related to this issue so I suppose it’s still actual. I will try to avoid restarting it in next couple of days in case you’ll need more details from live process, but if it continues to eat RAM I may have to restart it.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 2
- Comments: 52 (31 by maintainers)
I’m strongly in favour of setting a timeout for collectors.
We’re using the
upmetric for the Node Exporter as a canary for checking that hosts are up (replacing the Nagios’ active checks that use ICMP ping), and we’re seeing false positives because the Node Exporter occasionally stops responding to HTTP requests (even on/, which is odd).When the Node Exporter starts responding to requests again (without us taking any remedial action), there’s a log line showing that a connection to dbus (used by the systemd collector) timed out after 400+ seconds.
There’s an issue with dbus/systemd to be investigated though I think a timeout is necessary to better isolate failures in an individual collector, especially when using the Node Exporter as a canary.
Would a PR be accepted that sets a global timeout (applicable to all collectors), configured using a commandline flag? What would the default be, 60 seconds (being generous to start with)?
@pmb311 Yes, we’re currently working on some other issues. We don’t have a strict release schedule, but I may put together a bugfix release sometime soon.
@byxorna The node exporter already supports concurrent scrapes and every collector is run in its own goroutine. So a single stuck scrape wouldn’t block other scrapes, but would just leak goroutines / sockets, like reported here. And there’s no way to abort e.g. a read from the filesystem after some timeout. There’s also no other general way of stopping collector goroutines other than threading
Contexts through everything and inserting code points into each collector where the collector can terminate itself prematurely if the context has been cancelled.It looks like when I run strace it somehow “unblock” node_exporter. Probably when I run it for the first time after your first request this result in closing that 1000 sockets. For example, here is
strace -ff -p $(pidof node_exporter)output for one of such fd (same happens for all of them):I don’t know how to find out “Where are the sockets connecting to?” using strace. It looks like they didn’t connected to anything -
ssoutput is authoritative here, their connections was closed some time ago, but node_exporter didn’t closed these fd. Probably strace output above means node_exporter (or Go itself) hangs just before or in the middle of writing response to the socket. And when strace attached to the process it somehow unblocked these write() and they get EPIPE because these connections was already closed by linux kernel long time ago.