node_exporter: Various crashes/segfaults on one host
Host operating system: output of uname -a
Linux raider 4.13.7-rt-rt1 #1 SMP PREEMPT RT Mon Nov 6 00:37:13 JST 2017 x86_64 Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz GenuineIntel GNU/Linux
node_exporter version: output of node_exporter --version
Tried both the official binary release:
node_exporter, version 0.15.0 (branch: HEAD, revision: 6e2053c557f96efb63aef3691f15335a70baaffd)
build user: root@168089f37ad9
build date: 20171006-11:33:58
go version: go1.9.1
And the same version, built from source via Gentoo package (from logs):
Starting node_exporter (version=0.15.0, branch=non-git, revision=6e2053c)
Build context (go=go1.9.1, user=portage@raider, date=20171105-15:39:31)
node_exporter command line flags
/usr/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/
node_exporter has been crashing on one host (my laptop) after running for hours (being scraped by prometheus running on another host). The failure messages vary, but seem to suggest some kind of memory corruption.
Crash 1 (self-built): https://mrcn.st/p/tMtz7sQF
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc41ffc7fff pc=0x41439e]
Crash 2 (self-built): https://mrcn.st/p/qmZw6trr
panic: runtime error: slice bounds out of range
Crash 3 (self-built): https://mrcn.st/p/qLYEaOg1
runtime: pointer 0xc4203e2fb0 to unallocated span idx=0x1f1 span.base()=0xc4203dc000 span.limit=0xc4203e6000 span.state=3
runtime: found in object at *(0xc420382a80+0x80)
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)
Crash 4 (official release binary): https://mrcn.st/p/x4NGGxF7
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x76b998]
I realize this sounds like bad hardware, but this is my daily workstation and it’s otherwise reasonably stable (as stable as one can expect a Gentoo ~arch box with a lot of desktop apps, graphics drivers involved, etc to be anyway). I don’t have reason to suspect the hardware, and this machine gets plenty of stress testing (it’s Gentoo, so lots of compiling). My initial guess is a wild pointer somewhere is causing the breakage which manifests itself it various ways. Any idea how to track this down?
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 36 (11 by maintainers)
I know, it smells like a hardware problem, but node_exporter is the only software with this kind of issue in an otherwise rather active workstation with quite a heterogeneous workload, which suggests otherwise.
Also, I actually own a geiger counter, and I’m getting a slightly elevated (for my location) but entirely within normal background range reading of 0.12µSv/h. So that’s out too 😃
What a fascinating read!
Eesh, a crash in the golang GC? Time to figure out which joker hid a gamma radiation source in your data center.
OK, I brought up a bunch of parallel instances with a single collector each, plus one control, plus the straced system instance, all being scraped by prom. I’ll leave them running overnight and see which ones die, then test an older release.