node_exporter: Consider implementing timeout for collectors

I’ve just noticed prometheus fail to get metrics from node_exporter in last 2 days (my workstation uptime is 3 days, so node_exporter actually worked only about a day after last reboot). Log of node_exporter complains about “too many open files”, and it’s really hit 1024 open file limit:

# ls /proc/$(pidof node_exporter)/fd | wc -l
   1023

But most of them are sockets:

# ls -l /proc/$(pidof node_exporter)/fd | grep socket | wc -l
   1018

And I believe all of them except one is leaked because there are no related connections:

# ss -anp | grep node_exporter
tcp    LISTEN     0      128    127.0.0.1:9100                  *:*                   users:(("node_exporter",pid=9801,fd=4))

Also I noticed node_exporter uses too many memory, not sure is this related to this issue, but it shouldn’t be second process in memory usage top:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 1638 powerman   20   0 2514M 1476M 77384 S  0.5 18.6  1h44:02 /usr/bin/firefox
 9801 root       20   0 12.5G  432M  2764 S  0.0  5.4 10:53.37 /home/powerman/gocode/bin/node_exporter

I’m running node_exporter as a service under runit using this run file:

#!/bin/sh
exec 2>&1
exec /home/powerman/gocode/bin/node_exporter -web.listen-address 127.0.0.1:9100 \
    -collectors.enabled "conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,sockstat,stat,textfile,time,uname,version,vmstat,runit,tcpstat"

I’m now using node_exporter d890b63fb5ebf6144766cd43bf29f0a0e6192491 (about month old) and at a glance next commits doesn’t mention anything related to this issue so I suppose it’s still actual. I will try to avoid restarting it in next couple of days in case you’ll need more details from live process, but if it continues to eat RAM I may have to restart it.

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 2
Comments: 52 (31 by maintainers)

Most upvoted comments

I’m strongly in favour of setting a timeout for collectors.

We’re using the up metric for the Node Exporter as a canary for checking that hosts are up (replacing the Nagios’ active checks that use ICMP ping), and we’re seeing false positives because the Node Exporter occasionally stops responding to HTTP requests (even on /, which is odd).

When the Node Exporter starts responding to requests again (without us taking any remedial action), there’s a log line showing that a connection to dbus (used by the systemd collector) timed out after 400+ seconds.

There’s an issue with dbus/systemd to be investigated though I think a timeout is necessary to better isolate failures in an individual collector, especially when using the Node Exporter as a canary.

Would a PR be accepted that sets a global timeout (applicable to all collectors), configured using a commandline flag? What would the default be, 60 seconds (being generous to start with)?

mattbostock on Mar 18, 2017

@pmb311 Yes, we’re currently working on some other issues. We don’t have a strict release schedule, but I may put together a bugfix release sometime soon.

SuperQ on Jul 23, 2018

@byxorna The node exporter already supports concurrent scrapes and every collector is run in its own goroutine. So a single stuck scrape wouldn’t block other scrapes, but would just leak goroutines / sockets, like reported here. And there’s no way to abort e.g. a read from the filesystem after some timeout. There’s also no other general way of stopping collector goroutines other than threading Contexts through everything and inserting code points into each collector where the collector can terminate itself prematurely if the context has been cancelled.

juliusv on Oct 2, 2016

It looks like when I run strace it somehow “unblock” node_exporter. Probably when I run it for the first time after your first request this result in closing that 1000 sockets. For example, here is strace -ff -p $(pidof node_exporter) output for one of such fd (same happens for all of them):

[pid  8972] write(15, "HTTP/1.1 200 OK\r\nContent-Encodin"..., 4096) = -1 EPIPE (Broken pipe)
[pid  8972] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=9801, si_uid=0} ---
[pid  8972] rt_sigreturn()              = -1 EPIPE (Broken pipe)
[pid 30753] futex(0xc830d0e508, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  8972] epoll_ctl(5, EPOLL_CTL_DEL, 15, c840c894b4) = 0
[pid  8972] close(15)                   = 0
[pid  8972] futex(0xc8385ca908, FUTEX_WAIT, 0, NULL <unfinished ...>

I don’t know how to find out “Where are the sockets connecting to?” using strace. It looks like they didn’t connected to anything - ss output is authoritative here, their connections was closed some time ago, but node_exporter didn’t closed these fd. Probably strace output above means node_exporter (or Go itself) hangs just before or in the middle of writing response to the socket. And when strace attached to the process it somehow unblocked these write() and they get EPIPE because these connections was already closed by linux kernel long time ago.

powerman on May 24, 2016