node_exporter: Kernel bug skewing node_cpu{mode="steal"} metrics
Host operating system: output of uname -a
Debian 9 Linux ip-10-11-1-110 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux
node_exporter version: output of node_exporter --version
prom/node-exporter:v0.15.0
node_exporter command line flags
/usr/bin/docker run \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v /:/rootfs:ro \
-e HOST_HOSTNAME="/rootfs/etc/hostname" \
-p 19997:9100 \
--pid="host" \
--name %i \
basi/node-exporter:v1.14.0 \
-collector.procfs "/host/proc" \
-collector.sysfs /host/sys \
-collector.textfile.directory /etc/node-exporter/ \
-collectors.enabled 'conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs,systemd' \
-collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($$|/)"
Are you running node_exporter in Docker?
Yes
What did you do that produced an error?
sum(rate(node_cpu{instance="$instance"}[1m])) by (mode) * 100 / count_scalar(node_cpu{mode="user", instance="$instance"})
What did you expect to see?
Around 100% usage if we aggregate all mode values
What did you see instead?
Some level of steal is expected as this is running on AWS, but I believe the value is out of proportion.
| Element | Value |
|---|---|
| {mode=“guest”} | 0 |
| {mode=“user”} | 0 |
| {mode=“iowait”} | 0.033333333333366175 |
| {mode=“system”} | 0 |
| {mode=“irq”} | 0 |
| {mode=“idle”} | 94.95555555551417 |
| {mode=“steal”} | 67456900655.031654 |
| {mode=“nice”} | 0 |
| {mode=“softirq”} | 0 |
Stats
# cat /proc/stat
cpu 10054 454 10504 1435808 6220 0 64 1802 0 0
cpu0 4970 268 5291 717279 3584 0 8 752 0 0
cpu1 5084 186 5212 718529 2636 0 56 1049 0 0
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 22 (14 by maintainers)
According to https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/ this affects guest kernel versions 4.8, 4.9 and 4.10.
It also appears that similar bugs existed before in Kernel 3.x, but was fixed in 4.x < 4.8.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785557;msg=61
Hi @SuperQ , I don’t think node_exporter should do anything to get around this issue. Since the issue indicates an abnormal situation, it is exactly the expected behavior for node_exporter to catch and report it.
One thing that came up in conversation was the option to split out the
stealmode from the normal metric, like we do withguestandguest_nice.What do people think about this?