windows_exporter: CPU collector blocks every ~17 minutes on call to wmi.Query
We’ve deployed the wmi-exporter on a number of our Windows systems running Server 2012r2 standard. This was a version built several months ago with Go 1.8.
These systems typically push 20-30 Gbps of traffic and have 24 hyperthreads. There is generally plenty of room on half of these threads.
Having deployed the exporter, we observed that every 17-18 minutes (independent of when the exporter was started), the traffic being served by them would drop significantly for roughly 120 seconds and CPU consumption on one or two cores would spike. The “WMI Provider Host” would also increase in CPU utilisation during this time.
Killing the exporter did not cause the load to drop off immediately.
We also noticed that calls to the exporter would block during this time.
We set GOMAXPROCS to 2, to see if there was any concurrency issues that might be overloading WMI, but that didn’t improve matters.
We built a new version from master on Friday June 30th with Go 1.8.1. The issue persisted.
We started to disable collectors to isolate where the problem was occurring and found that it only happened when the CPU collector was enabled. We put some printfs around the calls inside the collect function and found that it was blocking on the call to wmi.Query.
Interestingly, we did not see any pauses or breaks in the data while monitoring that data in perfmon while the query was running.
We dialled testing back to a 2 CPU VM running Windows 10 not doing anything particularly difficult or interesting and polling using “curl $blah ; sleep 60” in a loop.
The results suggest that something is going wrong every 17-18 minutes (the lines with *** were added by me):
time="2017-07-03T10:31:38-07:00" level=debug msg="OK: cpu collector succeeded after 0.032996s." source="exporter.go:90"
2017/07/03 10:32:38 *** Starting CPU Collector run...
2017/07/03 10:32:38 *** Created WMI Query for CPU...
2017/07/03 10:32:40 *** Ran WMI Query for CPU...
2017/07/03 10:32:40 *** Sending data for 0 to exporter...
2017/07/03 10:32:40 *** Sending data for 1 to exporter...
time="2017-07-03T10:32:40-07:00" level=debug msg="OK: cpu collector succeeded after 2.324834s." source="exporter.go:90"
2017/07/03 10:33:40 *** Starting CPU Collector run...
2017/07/03 10:33:40 *** Created WMI Query for CPU...
2017/07/03 10:33:40 *** Ran WMI Query for CPU...
2017/07/03 10:33:40 *** Sending data for 0 to exporter...
2017/07/03 10:33:40 *** Sending data for 1 to exporter...
time="2017-07-03T10:33:40-07:00" level=debug msg="OK: cpu collector succeeded after 0.046001s." source="exporter.go:90"
2017/07/03 10:34:40 *** Starting CPU Collector run...
2017/07/03 10:34:40 *** Created WMI Query for CPU...
2017/07/03 10:34:40 *** Ran WMI Query for CPU...
2017/07/03 10:34:40 *** Sending data for 0 to exporter...
2017/07/03 10:34:40 *** Sending data for 1 to exporter...
< snip a few more ~0.04 second runs >
time="2017-07-03T10:46:41-07:00" level=debug msg="OK: cpu collector succeeded after 0.044845s." source="exporter.go:90"
2017/07/03 10:47:41 *** Starting CPU Collector run...
2017/07/03 10:47:41 *** Created WMI Query for CPU...
2017/07/03 10:47:41 *** Ran WMI Query for CPU...
2017/07/03 10:47:41 *** Sending data for 0 to exporter...
2017/07/03 10:47:41 *** Sending data for 1 to exporter...
time="2017-07-03T10:47:41-07:00" level=debug msg="OK: cpu collector succeeded after 0.038003s." source="exporter.go:90"
2017/07/03 10:48:41 *** Starting CPU Collector run...
2017/07/03 10:50:05 *** Created WMI Query for CPU...
2017/07/03 10:50:08 *** Ran WMI Query for CPU...
2017/07/03 10:50:08 *** Sending data for 0 to exporter...
2017/07/03 10:50:08 *** Sending data for 1 to exporter...
time="2017-07-03T10:50:08-07:00" level=debug msg="OK: cpu collector succeeded after 86.787446s." source="exporter.go:90"
07/03 10:51:08 *** Starting CPU Collector run...
2017/07/03 10:51:08 *** Created WMI Query for CPU...
2017/07/03 10:51:08 *** Ran WMI Query for CPU...
2017/07/03 10:51:08 *** Sending data for 0 to exporter...
2017/07/03 10:51:08 *** Sending data for 1 to exporter...
time="2017-07-03T10:51:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.034084s." source="exporter.go:90"
I tested again using official version v0.2.5 and again observed an unexpected increase in gathering time ~17 minutes after the last spike.
time="2017-07-03T11:01:03-07:00" level=info msg="Starting server on :9182" source="exporter.go:206"
time="2017-07-03T11:01:04-07:00" level=debug msg="OK: cpu collector succeeded after 0.186394s." source="exporter.go:90"
time="2017-07-03T11:02:04-07:00" level=debug msg="OK: cpu collector succeeded after 0.022045s." source="exporter.go:90"
time="2017-07-03T11:03:04-07:00" level=debug msg="OK: cpu collector succeeded after 0.025063s." source="exporter.go:90"
time="2017-07-03T11:04:05-07:00" level=debug msg="OK: cpu collector succeeded after 0.031924s." source="exporter.go:90"
time="2017-07-03T11:05:05-07:00" level=debug msg="OK: cpu collector succeeded after 0.027066s." source="exporter.go:90"
time="2017-07-03T11:06:07-07:00" level=debug msg="OK: cpu collector succeeded after 0.036162s." source="exporter.go:90"
time="2017-07-03T11:07:10-07:00" level=debug msg="OK: cpu collector succeeded after 3.061992s." source="exporter.go:90"
time="2017-07-03T11:08:10-07:00" level=debug msg="OK: cpu collector succeeded after 0.036285s." source="exporter.go:90"
time="2017-07-03T11:09:10-07:00" level=debug msg="OK: cpu collector succeeded after 0.036368s." source="exporter.go:90"
time="2017-07-03T11:10:11-07:00" level=debug msg="OK: cpu collector succeeded after 0.187498s." source="exporter.go:90"
time="2017-07-03T11:11:07-07:00" level=debug msg="OK: cpu collector succeeded after 0.031065s." source="exporter.go:90"
time="2017-07-03T11:12:07-07:00" level=debug msg="OK: cpu collector succeeded after 0.030048s." source="exporter.go:90"
time="2017-07-03T11:13:07-07:00" level=debug msg="OK: cpu collector succeeded after 0.030922s." source="exporter.go:90"
time="2017-07-03T11:14:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.030062s." source="exporter.go:90"
time="2017-07-03T11:15:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.024992s." source="exporter.go:90"
time="2017-07-03T11:16:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.030092s." source="exporter.go:90"
time="2017-07-03T11:17:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.030072s." source="exporter.go:90"
time="2017-07-03T11:18:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.029085s." source="exporter.go:90"
time="2017-07-03T11:19:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.031077s." source="exporter.go:90"
time="2017-07-03T11:20:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.031975s." source="exporter.go:90"
time="2017-07-03T11:21:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.034084s." source="exporter.go:90"
time="2017-07-03T11:22:08-07:00" level=debug msg="OK: cpu collector succeeded after 0.031888s." source="exporter.go:90"
time="2017-07-03T11:23:11-07:00" level=debug msg="OK: cpu collector succeeded after 2.759751s." source="exporter.go:90"
time="2017-07-03T11:24:11-07:00" level=debug msg="OK: cpu collector succeeded after 0.036913s." source="exporter.go:90"
I don’t understand WMI enough to understand where to start digging, but it would be useful to understand if anyone else observes this.
If there’s other data that is useful to gather. please let me know.
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 3
- Comments: 72 (12 by maintainers)
After some profiling, it looks like the WMI Provider Server regularly refreshes some sort of class name cache, calling
EnumCounterObjects
, which in turns reads a massive number of registry keys, in line with what @higels discovered. I haven’t yet been able to figure out what triggers it.It appears that EnumCounterObjects is not supposed to do this, and there’s a fix in KB4038774 (and presumably the following non-security rollup, even though there’s no further mention of it).
I see no mention of Windows Server 2008 being affected.
However, after installing the KB in question, I’m still getting timeouts.
I also did some experimenting with these flags, with wbemFlagReturnImmediately and your build as well. Sadly does not make a difference for my setup 😦
Here’s what I’ve been working on: https://github.com/leoluk/perflib_exporter
It includes both a library for the perflib interface and a simple exporter. Godoc: https://godoc.org/github.com/leoluk/perflib_exporter/perflib
Performance has been great on my staging systems with most scrapes completing within a few dozen milliseconds, including full service and process list. Once the remaining tasks are completed and the interface is stable, it will be used in a large Windows production setup.
Feedback would be highly appreciated, especially about the perflib API - I designed it such that it could be used by projects like wmi_exporter.
Perfmon uses the PDH library, not WMI. I did not test with Perfmon, but PDH is not affected.
Here’s a quick summary of the relationship: https://msdn.microsoft.com/en-us/library/windows/desktop/aa371643(v=vs.85).aspx
I’ve started to work on a Go library for the HKEY_PERFORMANCE_DATA interface. You basically do one RegQueryValueEx call and get a 300k binary blob in return that you’ll have to parse yourself, but it has all the metrics in it and only takes a few ms to complete. Preliminary testing looks great so far.
We probably should re-open this, since it is still quite actively discussed.
My current assumption is that the bug is in Windows, not wmi_exporter. I can reproduce it by querying WMI through PowerShell and the hang is system-wide.
To me it appears like querying performance data via WMI, in low intervals, is fundamentally broken. About every 1000 seconds, there’s a spike in query latency where WMI updates its list of perflib providers.
The KB lessens the impact by making it less expensive, but does not fully resolve the issue (it didn’t help in my case). The refresh would have to happen asynchronously in a way that does not affect queries.
The reason this affects some users but not others is that the duration depends on the hardware configuration. On a tiny VM it took less than 6 seconds, much longer on beefy bare-metal hosts. Depending on scrape interval and timeout, it makes sense that it would only lead to issues in some configurations. Apparently, it takes a longer time the more RAM a host has. Stack profiling shows that the most expensive call is
CollectSysProcessObjectData
which in turns enumerates memory:Even if you do not get timeouts, you’ll probably see the spikes in the
wmi_exporter_collector_duration_seconds
metric:(red = VM, green = bare metal, for the bare-metal host one scrape fails so the actual value is even higher)
To me, this implies that in the presence of this bug, WMI is unsuitable for high-resolution performance monitoring. Even it does not lead to timeouts, it will slightly decrease accuracy.
This issue also explains a long-standing issue I’ve had with WMI blocking the Check_MK windows agent, which reported false CPU load spikes on hosts with more than 500GB of RAM, with three minute check intervals, mind you:
OP here… I completely forgot to update this issue.
We randomly found out what made the scrapes jam up so badly in our particular use-case. We had left a virtual media ISO mounted through the iDrac (Dell’s BMC thing). As soon as we unmounted that, the impact of the 17 minute scrape went from a many 10s of seconds down to a more normal time.
So, if you are:
Then, just make sure you unmount the ISO when you aren’t using it.
Yes, it’s fixed in 2016 (and has been for a while) but not 2012.
In case you’re referring to perflib_exporter, there’s a test version on the releases page: https://github.com/leoluk/perflib_exporter/releases
I’m actively working on getting it production-ready.
It’s ready for testing. It has been running in a staging environment for the past four months without missing scrapes or misbehaving, so I consider the perflib library stable.
However, there a a few edge cases in the name mangling code left to fix which will change the output, so that part isn’t stable yet. You can expect a beta release within the next weeks.
If anyone wants to contribute, send me an email!
@breed808 thanks a lot for looking into this issue so quickly.
If switching PerfLib is not possible, maybe there is potential for optimizing the query? I suspect not simply querying all fields / details of the Win32_LogonSession WMI (https://github.com/prometheus-community/windows_exporter/blob/master/collector/logon.go#L54) might make a big difference.
Maybe there even is another metric just giving the number of logins without all those expensive details to be fetched?
From a quick look there doesn’t appear to be an equivalent perflib counter for
Win32_LogonSession
😞 At the minimum I think it’s worth documenting the potential performance impact of the collector.@carlpett Yes unfortunately.
I was just hunting down an issue like this on version 0.14 and it seems to be caused by the
logon
collector. Once I took that out of the list the duration of a scrape dropped from ~4.5 seconds to < 0.5 seconds (likely because the machine in question was running exchange with a lot of connections / sessions).But in addition there seem to be no more spikes happening without the logon collector until now. I configured two equal machines, one with and the other without logon to see if this theory hold up.
This issue has been solved for me 👍
I’m seeing this on windows 2016, does someone know which windows update that solved this issue?
I confirm this would be very interesting to not have those blank phases. Hosts are wrongly reported down in our case (After 2 minutes of failed scrapping).
Well, you are at least the only one observant enough to notice it 😃 One thing that would be interesting to test would be if this happens with other WMI clients. Could you try calling
while($true) { Measure-Command { Get-WMIObject Win32_PerfRawData_PerfOS_Processor }; sleep 30 }
in a Powershell prompt and see if that also seems to hang every 17 minutes? If it does, does this also happen if you use perfmon?