node_exporter: Metrics stuck / deadlock on some kernel versions
Host operating system: output of uname -a
Linux host 5.4.211-1.el8.elrepo.x86_64 #1 SMP Wed Aug 24 08:50:28 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of node_exporter --version
node_exporter, version 1.4.0 (branch: HEAD, revision: 7da1321761b3b8dfc9e496e1a60e6a476fec6018) build user: root@83d90983e89c build date: 20220926-12:32:56 go version: go1.19.1 platform: linux/amd64
node_exporter command line flags
–no-collector.cpufreq --no-collector.systemd --collector.sockstat --no-collector.logind --collector.textfile.directory=/var/lib/prometheus
Are you running node_exporter in Docker?
No
What did you do that produced an error?
after 2-24 hour, randomized
What did you expect to see?
Not stuck node_exporter
What did you see instead?
Stuck node exporter after random time.
After run strace -f -p pgrep node_exporter
node exporter unstuck
Out pprof:
debug1.txt
debug2.txt
Responce > 1 minute
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 8
- Comments: 69 (12 by maintainers)
Commits related to this issue
- Default GOMAXPROCS to 1 Avoid running on all CPUs by limiting the Go runtime to one CPU by default. Avoids having Go routines schedule on every CPU, driving up the visible run queue length on high CP... — committed to prometheus/node_exporter by SuperQ 3 years ago
- Default GOMAXPROCS to 1 Avoid running on all CPUs by limiting the Go runtime to one CPU by default. Avoids having Go routines schedule on every CPU, driving up the visible run queue length on high CP... — committed to prometheus/node_exporter by SuperQ 3 years ago
- Default GOMAXPROCS to 1 (#2530) Avoid running on all CPUs by limiting the Go runtime to one CPU by default. Avoids having Go routines schedule on every CPU, driving up the visible run queue length ... — committed to prometheus/node_exporter by SuperQ 2 years ago
- Release v1.5.0 NOTE: This changes the Go runtime "GOMAXPROCS" to 1. This is done to limit the concurrency of the exporter to 1 CPU thread at a time in order to avoid a race condition problem in t... — committed to prometheus/node_exporter by SuperQ 2 years ago
for anyone needing to temporarily work around this issue - make a change to your systemd unit file (usually something like
/etc/systemd/system/node_exporter.serviceor similar) like this under[Service]section:afterwards run
systemctl daemon-reloadand restart the servicesystemctl restart node_exporterand you should be good to go until a permanent fix is ready 😃reproduce script
main.go
go.mod
just run script:
after ~1 minutes script stucks
I am currently in the process of bisecting the Ubuntu kernel to isolate the relevant change so that an upstream bug can be opened. Thanks @paragor for your reproducer, that makes it a lot easier…
P.S.: cadvisor is having exactly the same problem
I have same issue with Ubuntu when update kernel to 5.4.0-132-generic
I have published node_exporter v1.5.0 which now defaults GOMAXPROCS to 1.
It also has a new collector feature,
--collector.netclass.netlinkthat enables gathering metrics via netlink rather than sysfs. This is much more effficient, but lacks a couple metrics that are available in sysfs. (node_network_address_assign_type,node_network_device_id, andnode_network_name_assign_type)This new feature is not enabled by default, so you will need to include it in your deployment.
Ubuntu just released a new kernel package
linux: 5.4.0-135.152that supposedly fixes this issue:See https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678/comments/28
Thanks all, it seems like the upstream kernel issues have been identified and fixed. I’m going to call this one solved.
The workaround is to add a
GOMAXPROCS=1environment variable, so from the CLI you should rundocker run -e GOMAXPROCS=1 -ti prom/node-exporter(with whatever other parameters you were using before).If you use
docker-compose, you have to add something like:I can confirm the same issue, even a curl running against localhost:9100 is hanging until the exporter is restarted. After some undefined time the exporter responds again.
I noticed this first after a recent kernel update but I’m not able to specify the exact kernel version when the issues came up first. I can also confirm that this happend with a quite old node exporter version and also with the latest 1.4.0.
After trying to find out more about the issue using log.level=debug I can say that the log didn’t showed any helpful information.
The problem has not gone 100%, but has become noticeably less frequent.
Al2 has removed the
epoll: autoremove wakers even more aggressivelypatch from the 5.4 branch which was first released as 5.4.219-126.411 which is why you’ve been unable to reproduce the issue on the AL2 5.4.219-126.411 kernel.Can’t reproduce the bug with the commit reverted (Vagrant box w/ 10 vCPUs on a 20 vCPU host). Documented this finding in the Ubuntu bugtracker at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1990190
Edit: I also posted the issue to the linux-fsdevel mailing list.
If you’re on 20.04, you can use linux-generic-hwe-20.04-edge which is based off 5.15 at present. This doesn’t display the behavior observed in 5.4 (yet, at least…).
Given that GOMAXPROCS=1 seems to help, straceing the process + threads seems to unblock it, and when I do so, I see a bunch of futex and epoll related stuff, the following change looked interesting.
I haven’t had time to investigate more because this reminded me that I need to reinstall and upgrade all of my 18.04 hosts.
@paragor Thanks for the test case. I made a slightly modified version.
One experiment I tried was setting
GOMAXPROCS=1. This seemed to eliminate the issue.Although, doing this in the node_exporter itself would make it slower, as it will have to serialize a lot of the sysfs file access.
I’m going to experiment with https://github.com/prometheus/node_exporter/pull/2492 today to see if switching from sysfs to netlink for this data would be a better solution.
Same here, Does anyone have solution without downgrade kernel for fix this ???
In my case, the problem started after an upgrade from 5.4.0-1088-aws to 5.4.0-1089-aws
FYI, I’m using node_exporter within a docker container on EC2s running Ubuntu 18.04.1 The same erratic problems also started in cadvisor after this upgrade.
has reproduced locally on ubuntu (vagrant box https://app.vagrantup.com/bento/boxes/ubuntu-20.04/versions/202206.03.0 with needed kernel https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.4.214/amd64/)
command:
and it need to start load test via vegeta (vegeta)
bug will show up after ~1-2 minutes
cant reproduce without collector.netclass (cmd
node_exporter --web.max-requests=0 --no-collector.netclass) in 20 minutes testso i think bug is near collector.netclass
Also stable with 5.4.0-1092-aws (ubuntu 18 EC2). 10+ instances running for 24 hours without issues.
Sharing our poor man’s docker-compose solution for the restarts until this issue resolved
note: change to http if you don’t do tls
Yes, I’m pretty sure the observed behavior is a kernel regression. Something in reading sysfs is causing deadlocks.
can also confirm, after updating my Ubuntu 18 lts to 5.4.0-132 and then a reboot, node exporter is taking more than a minute to export metrics (~100+ s)
a restart of the container fixes it temporarily
This suddenly started to happen on some of the systems I administer as well.
Only relevant update we performed was kernel upgrade on plain Ubuntu 20.04 today:
5.4.0-128-generic->5.4.0-132-generic(5.4.0-132-generic #148-Ubuntu SMP Mon Oct 17 16:02:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux) across a range of systems. So far it seems the systems updated to5.4.0-131-genericearlier are not affected.Difference in this kernel (here) list these interesting changes related to
netclasscollector:I hope this helps to pinpoint the issue.
When reproducing, it’s definitely worth making sure your test node has several CPUs in order to make Go run more work concurrently across threads.
As my second bisect series also turned out without a meaningful result, I have taken
Ubuntu-5.4.0-132.148with that commit reverted and will run @paragor’s script against it. Let’s see if we got the culprit…We’ve hit this on multiple Centos7 VMs running 5.4.217-1.el7.elrepo.x86_64, 5.4.219-1.el7.elrepo.x86_64 and 5.4.224-1.el7.elrepo.x86_64.
We were able to repro using @paragor’s script and found that sending a SIGSTOP and SIGCONT unfreezes the program. This reminded us of https://github.com/Mirantis/cri-dockerd/issues/124 which also got better after sending SIGSTOP/SIGCONT.
We also found that
GOMAXPROCS=1stopped the problem.Some brief initial investigation using
straceon each thread (withGOMAXPROCS=2) showed that the program seems to lock up when a thread gets stuck in anepoll_pwait()while the other threads are in afutex(), the epoll didn’t look like it was doing something to do with sysfs, just networking stuff. We didn’t get much further with this line of enquiry and plan to use theGOMAXPROCS=1workaround until this is resolved.This did not resolve the issue for me.
Node-exporter version 1.3.1
kernel - 5.4.0-132 It is also reproduced
Also seeing this on Ubuntu Bionic’s hwe edge kernel 5.4.0-132.
Changelog is here: https://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_5.4.0-132.148/changelog
I’ve noticed that when it blocks, as soon as I
strace -fpthe node exporter process, everything seems to unblock.Even on slow scrapes, I didn’t spot any collectors reporting a high scrape time, but I’ll keep digging.
Same here, the 5.4.0-135 kernel have fixed this issue for my nodes with Ubuntu 18.04.
Same for 5.4.0-1082-kvm kernel across about 40 systems currently.
The 5.4.0-135 kernel seems to have fixed this issue for me. More than 24 hours now without issue.
Ubuntu 20.04.5 raspberry pi 4 same issue kernel: 5.4.0-1074-raspi systemd: 245.4-4ubuntu3.18 node-exporter: 0.18.1+ds go version: go1.13.8
I can confirm, we see the same problem with
5.4.17-2136.312.3.4.el8uek.x86_64kernel on every new VM. And this is not only about node-exporter, some of our services also stopped processing requests with5.4.17-2136.312.3.4.el8uek.x86_64kernel. So we have to downgrade kernel, for example there is no such problem with5.4.17-2136.310.7.1.el8uek.x86_64and4.18.0-372.26.1.0.1.el8_6.x86_64.It’s interesting: after node-exporter process is stuck, I can connect to it with
strace -f -p $PID, and it starts working again (as I suppose, it’s because ofrestart_syscallsyscall).PS. New OL kernel
5.4.17-2136.313.6.el8uek.x86_64released, problem is still there.It seems that there are plenty of issues with Kernel
5.4.214-120and higher. For those of you who use EKS and Amazon AMIs - they decided to pin to the5.4.209-116.367because of some instability until they discover what exactly is the problem.https://github.com/awslabs/amazon-eks-ami/pull/1072
We have exactly the same issue. After upgrading to
5.4.214-120we suffer many random node-exporter restarts in Kubernetes (Amazon Linux 2). Nothing suspicious in logs/metrics, we just see the following message in pod description:It worked without any problems on
kernel-5.4.209-116