telegraf: [input.nvidia_smi] Broken power monitoring due to XML schema change
Relevant telegraf.conf
[[inputs.nvidia_smi]]
Logs from Telegraf
No errors in logs (even with `--debug`)
System info
Windows, Telegraf 1.26.0, nvidia-smi from official nvidia drivers 536.40
Docker
No response
Steps to reproduce
1.Activate the nvidia_smi input plugin
2.Verify that metrics exist
3.Notice that power related metrics are missing
ā¦
Expected behavior
Power related metrics should be available
Actual behavior
Power metrics are not available
Additional info
This appears to have started 2-3 months ago with an update to the nvidia drivers and nvidia-smi. I dont know the exact driver version that changed this, even looking at nvidia docs.
It appears nvidia have changed the schema from v11 to v12, current output of nvidia-smi -x -q shows:
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd">
As can be seen from the latest example included in this repo (3 months ago!), the previous version v11 had the following block available:
<power_readings>
<power_state>P0</power_state>
<power_management>Supported</power_management>
<power_draw>26.78 W</power_draw>
<power_limit>70.00 W</power_limit>
<default_power_limit>70.00 W</default_power_limit>
<enforced_power_limit>70.00 W</enforced_power_limit>
<min_power_limit>60.00 W</min_power_limit>
<max_power_limit>70.00 W</max_power_limit>
</power_readings>
Which is what the plugin code expects to find. (And has been this way for at least 4 years, in all previous versions of the schema that telegraf has supported, afaik)
The new schema has changed the way it reports this, with two different blocks:
<gpu_power_readings>
<power_state>P8</power_state>
<power_draw>22.78 W</power_draw>
<current_power_limit>336.00 W</current_power_limit>
<requested_power_limit>336.00 W</requested_power_limit>
<default_power_limit>320.00 W</default_power_limit>
<min_power_limit>100.00 W</min_power_limit>
<max_power_limit>336.00 W</max_power_limit>
</gpu_power_readings>
<module_power_readings>
<power_state>P8</power_state>
<power_draw>N/A</power_draw>
<current_power_limit>N/A</current_power_limit>
<requested_power_limit>N/A</requested_power_limit>
<default_power_limit>N/A</default_power_limit>
<min_power_limit>N/A</min_power_limit>
<max_power_limit>N/A</max_power_limit>
</module_power_readings>
It seems the block we want now is gpu_power_readings, and it appears to be mostly the same for the purposes of this plugin (<power_draw> is the same).
This was taken from nvidia-smi -x -q bundled with the nvidia drivers 536.40 on windows 10 for an RTX3080. I have also seen the same thing for a GTX1070 on a W11 machine, and also on linux.
To reiterate, this seems to have happened in the last 2-3 months, for all platforms, as far as I am aware.
I assume this line needs to be adjusted to gpu_power_readings, and everything should work?
I have attached the full output of nvidia-smi -x -q (pruned the <processes> section for privacy)
smi.log
You can add it to the testdata samples if needed.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (9 by maintainers)
@mbentley file a new issue
Hi,
If you take a look at the PR that fixed this issue it has a milestone attached to it. That is when this fix will be released. In this case our next minor release, v1.28.0, which is in a few weeks.
If you do that, you break everyone else who is still on older drivers or the older format š
What I would suggest is that we added a new struct for
GPUPowerReadingsandModulePowerReadingsso that if either of these read they are captured.Then we can call
setIfUsedon those fields.Would you be willing to put up a PR?
Thanks!