node_exporter: Metrics stuck / deadlock on some kernel versions

Host operating system: output of uname -a

Linux host 5.4.211-1.el8.elrepo.x86_64 #1 SMP Wed Aug 24 08:50:28 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.4.0 (branch: HEAD, revision: 7da1321761b3b8dfc9e496e1a60e6a476fec6018) build user: root@83d90983e89c build date: 20220926-12:32:56 go version: go1.19.1 platform: linux/amd64

node_exporter command line flags

–no-collector.cpufreq --no-collector.systemd --collector.sockstat --no-collector.logind --collector.textfile.directory=/var/lib/prometheus

Are you running node_exporter in Docker?

No

What did you do that produced an error?

after 2-24 hour, randomized

What did you expect to see?

Not stuck node_exporter

What did you see instead?

Stuck node exporter after random time. After run strace -f -p pgrep node_exporter node exporter unstuck Out pprof: debug1.txt debug2.txt

Responce > 1 minute

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 8
  • Comments: 69 (12 by maintainers)

Commits related to this issue

Most upvoted comments

for anyone needing to temporarily work around this issue - make a change to your systemd unit file (usually something like /etc/systemd/system/node_exporter.service or similar) like this under [Service] section:

[Service]
...
Environment=GOMAXPROCS=1
...

afterwards run systemctl daemon-reload and restart the service systemctl restart node_exporter and you should be good to go until a permanent fix is ready 😃

reproduce script

main.go

package main

import (
	"fmt"
	"github.com/prometheus/procfs/sysfs"
	"log"
	"net/http"
)

func main() {

	fs, err := sysfs.NewFS("/sys")
	if err != nil {
		panic(err)
	}
	netDevices, err := fs.NetClassDevices()

	http.HandleFunc("/", func(writer http.ResponseWriter, request *http.Request) {
		for _, device := range netDevices {
			_, err := fs.NetClassByIface(device)
			if err != nil {
				panic(err)
			}
		}
		fmt.Printf(". ")
		writer.WriteHeader(200)
		writer.Write([]byte("ok"))
	})
	log.Fatal(http.ListenAndServe(":9100", nil))
}

go.mod

module app

go 1.19

require (
	github.com/prometheus/procfs v0.8.0 // indirect
	golang.org/x/sync v0.0.0-20220601150217-0de741cfad7f // indirect
)

just run script:

go run main.go &
for i in {0..10000} ; do curl localhost:9100/metrics ; done

after ~1 minutes script stucks

I am currently in the process of bisecting the Ubuntu kernel to isolate the relevant change so that an upstream bug can be opened. Thanks @paragor for your reproducer, that makes it a lot easier…

P.S.: cadvisor is having exactly the same problem

I have same issue with Ubuntu when update kernel to 5.4.0-132-generic

I have published node_exporter v1.5.0 which now defaults GOMAXPROCS to 1.

It also has a new collector feature, --collector.netclass.netlink that enables gathering metrics via netlink rather than sysfs. This is much more effficient, but lacks a couple metrics that are available in sysfs. (node_network_address_assign_type, node_network_device_id, and node_network_name_assign_type)

This new feature is not enabled by default, so you will need to include it in your deployment.

Ubuntu just released a new kernel package linux: 5.4.0-135.152 that supposedly fixes this issue:

This bug was fixed in the package linux - 5.4.0-135.152

---------------
linux (5.4.0-135.152) focal; urgency=medium

  * focal/linux: 5.4.0-135.152 -proposed tracker (LP: [#1997412](https://bugs.launchpad.net/bugs/1997412))

  * containerd sporadic timeouts (LP: [#1996678](https://bugs.launchpad.net/bugs/1996678))
    - epoll: call final ep_events_available() check under the lock
    - epoll: check for events when removing a timed out thread from the wait queue
    - Revert "fs: check FMODE_LSEEK to control internal pipe splicing"
    
    [...]

See https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678/comments/28

Thanks all, it seems like the upstream kernel issues have been identified and fixed. I’m going to call this one solved.

The workaround is to add a GOMAXPROCS=1 environment variable, so from the CLI you should run docker run -e GOMAXPROCS=1 -ti prom/node-exporter (with whatever other parameters you were using before).

If you use docker-compose, you have to add something like:

node-exporter:
    image: prom/node-exporter
    environment:
      GOMAXPROCS: '1'

I can confirm the same issue, even a curl running against localhost:9100 is hanging until the exporter is restarted. After some undefined time the exporter responds again.

I noticed this first after a recent kernel update but I’m not able to specify the exact kernel version when the issues came up first. I can also confirm that this happend with a quite old node exporter version and also with the latest 1.4.0.

After trying to find out more about the issue using log.level=debug I can say that the log didn’t showed any helpful information.

in reply to @b0n0r’s advice

for ubuntu 20.04 the service file is located at /lib/systemd/system/prometheus-node-exporter.service and environment variables for prometheus-node-exporter are in /etc/default/prometheus-node-exporter

after adding GOMAXPROCS=1 to the env file restart the service with systemctl restart prometheus-node-exporter

check logs with journalctl -u prometheus-node-exporter -e | tail -n 111

The problem has not gone 100%, but has become noticeably less frequent.

FWIW I got a hang on the second try of the for loop from @paragor above with kernel 5.4.214-120 on Amazon Linux 2, but after upgrading to 5.4.219-126.411.amzn2.x86_64 it’s survived over 20 runs without issues.

hmm ubuntu with 5.4.219-0504219-generic still reproduced

Al2 has removed the epoll: autoremove wakers even more aggressively patch from the 5.4 branch which was first released as 5.4.219-126.411 which is why you’ve been unable to reproduce the issue on the AL2 5.4.219-126.411 kernel.

- epoll: autoremove wakers even more aggressively

As my second bisect series also turned out without a meaningful result, I have taken Ubuntu-5.4.0-132.148 with that commit reverted and will run @paragor’s script against it. Let’s see if we got the culprit…

Can’t reproduce the bug with the commit reverted (Vagrant box w/ 10 vCPUs on a 20 vCPU host). Documented this finding in the Ubuntu bugtracker at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1990190

Edit: I also posted the issue to the linux-fsdevel mailing list.

If you’re on 20.04, you can use linux-generic-hwe-20.04-edge which is based off 5.15 at present. This doesn’t display the behavior observed in 5.4 (yet, at least…).

Given that GOMAXPROCS=1 seems to help, straceing the process + threads seems to unblock it, and when I do so, I see a bunch of futex and epoll related stuff, the following change looked interesting.

- epoll: autoremove wakers even more aggressively

I haven’t had time to investigate more because this reminded me that I need to reinstall and upgrade all of my 18.04 hosts.

@paragor Thanks for the test case. I made a slightly modified version.

One experiment I tried was setting GOMAXPROCS=1. This seemed to eliminate the issue.

Although, doing this in the node_exporter itself would make it slower, as it will have to serialize a lot of the sysfs file access.

I’m going to experiment with https://github.com/prometheus/node_exporter/pull/2492 today to see if switching from sysfs to netlink for this data would be a better solution.

Same here, Does anyone have solution without downgrade kernel for fix this ???

In my case, the problem started after an upgrade from 5.4.0-1088-aws to 5.4.0-1089-aws

FYI, I’m using node_exporter within a docker container on EC2s running Ubuntu 18.04.1 The same erratic problems also started in cadvisor after this upgrade.

has reproduced locally on ubuntu (vagrant box https://app.vagrantup.com/bento/boxes/ubuntu-20.04/versions/202206.03.0 with needed kernel https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.4.214/amd64/)

# uname -a
Linux vagrant 5.4.214-0504214-generic #202209200739 SMP Tue Sep 20 11:46:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux 
# uptime
13:15:18 up  5:47,  5 users,  load average: 0.18, 0.35, 0.52

command:

node_exporter --collector.disable-defaults --web.max-requests=0 --collector.netclass

and it need to start load test via vegeta (vegeta)

echo 'GET http://localhost:9100/metrics' |  vegeta attack -rate 50 | vegeta report -every 1s

bug will show up after ~1-2 minutes

cant reproduce without collector.netclass (cmd node_exporter --web.max-requests=0 --no-collector.netclass) in 20 minutes test

so i think bug is near collector.netclass

Also stable with 5.4.0-1092-aws (ubuntu 18 EC2). 10+ instances running for 24 hours without issues.

Sharing our poor man’s docker-compose solution for the restarts until this issue resolved

    healthcheck:
      test: ["CMD-SHELL", "wget -O /dev/null --no-check-certificate -T 2 https://127.0.0.1:9100 2>&1 | grep -q 'timed out' && killall -TERM node_exporter && exit 1 || true"]
      interval: 10s
      timeout: 5s
      retries: 2

note: change to http if you don’t do tls

Yes, I’m pretty sure the observed behavior is a kernel regression. Something in reading sysfs is causing deadlocks.

can also confirm, after updating my Ubuntu 18 lts to 5.4.0-132 and then a reboot, node exporter is taking more than a minute to export metrics (~100+ s)

a restart of the container fixes it temporarily

This suddenly started to happen on some of the systems I administer as well.

Only relevant update we performed was kernel upgrade on plain Ubuntu 20.04 today: 5.4.0-128-generic -> 5.4.0-132-generic (5.4.0-132-generic #148-Ubuntu SMP Mon Oct 17 16:02:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux) across a range of systems. So far it seems the systems updated to 5.4.0-131-generic earlier are not affected.

Difference in this kernel (here) list these interesting changes related to netclass collector:

    - net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
    - net: Fix data-races around netdev_tstamp_prequeue.
    - net: Fix a data-race around sysctl_tstamp_allow_data.
    - net: Fix a data-race around sysctl_net_busy_poll.
    - net: Fix a data-race around sysctl_net_busy_read.
    - net: Fix a data-race around netdev_budget.
    - net: Fix a data-race around netdev_budget_usecs.
    - net: Fix a data-race around sysctl_somaxconn.

I hope this helps to pinpoint the issue.

When reproducing, it’s definitely worth making sure your test node has several CPUs in order to make Go run more work concurrently across threads.

- epoll: autoremove wakers even more aggressively

As my second bisect series also turned out without a meaningful result, I have taken Ubuntu-5.4.0-132.148 with that commit reverted and will run @paragor’s script against it. Let’s see if we got the culprit…

We’ve hit this on multiple Centos7 VMs running 5.4.217-1.el7.elrepo.x86_64, 5.4.219-1.el7.elrepo.x86_64 and 5.4.224-1.el7.elrepo.x86_64.

We were able to repro using @paragor’s script and found that sending a SIGSTOP and SIGCONT unfreezes the program. This reminded us of https://github.com/Mirantis/cri-dockerd/issues/124 which also got better after sending SIGSTOP/SIGCONT.

We also found that GOMAXPROCS=1 stopped the problem.

Some brief initial investigation using strace on each thread (with GOMAXPROCS=2) showed that the program seems to lock up when a thread gets stuck in an epoll_pwait() while the other threads are in a futex(), the epoll didn’t look like it was doing something to do with sysfs, just networking stuff. We didn’t get much further with this line of enquiry and plan to use the GOMAXPROCS=1 workaround until this is resolved.

We enountered the same problem on Ubuntu kernel 5.4.0-132-generic. A workaround is to disable the netclass-collector in the node_exporter.service by adding:

--no-collector.netclass

This did not resolve the issue for me.

Node-exporter version 1.3.1

kernel - 5.4.0-132 It is also reproduced

Also seeing this on Ubuntu Bionic’s hwe edge kernel 5.4.0-132.

Changelog is here: https://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_5.4.0-132.148/changelog

I’ve noticed that when it blocks, as soon as I strace -fp the node exporter process, everything seems to unblock.

Even on slow scrapes, I didn’t spot any collectors reporting a high scrape time, but I’ll keep digging.

Same here, the 5.4.0-135 kernel have fixed this issue for my nodes with Ubuntu 18.04.

Same for 5.4.0-1082-kvm kernel across about 40 systems currently.

The 5.4.0-135 kernel seems to have fixed this issue for me. More than 24 hours now without issue.

Ubuntu 20.04.5 raspberry pi 4 same issue kernel: 5.4.0-1074-raspi systemd: 245.4-4ubuntu3.18 node-exporter: 0.18.1+ds go version: go1.13.8

seeing maybe this also on oracle linux. not sure if this kernel version relates to the above mentioned ones

I can confirm, we see the same problem with 5.4.17-2136.312.3.4.el8uek.x86_64 kernel on every new VM. And this is not only about node-exporter, some of our services also stopped processing requests with 5.4.17-2136.312.3.4.el8uek.x86_64 kernel. So we have to downgrade kernel, for example there is no such problem with 5.4.17-2136.310.7.1.el8uek.x86_64 and 4.18.0-372.26.1.0.1.el8_6.x86_64.

It’s interesting: after node-exporter process is stuck, I can connect to it with strace -f -p $PID, and it starts working again (as I suppose, it’s because of restart_syscall syscall).

PS. New OL kernel 5.4.17-2136.313.6.el8uek.x86_64 released, problem is still there.

It seems that there are plenty of issues with Kernel 5.4.214-120 and higher. For those of you who use EKS and Amazon AMIs - they decided to pin to the 5.4.209-116.367 because of some instability until they discover what exactly is the problem.

https://github.com/awslabs/amazon-eks-ami/pull/1072

We have exactly the same issue. After upgrading to 5.4.214-120 we suffer many random node-exporter restarts in Kubernetes (Amazon Linux 2). Nothing suspicious in logs/metrics, we just see the following message in pod description:

Readiness probe failed: Get "http://XXXX:9100/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Liveness probe failed: Get "http://XXXX:9100/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

It worked without any problems on kernel-5.4.209-116