telegraf: [input.nvidia_smi] Broken power monitoring due to XML schema change

Relevant telegraf.conf

[[inputs.nvidia_smi]]

Logs from Telegraf

No errors in logs (even with `--debug`)

System info

Windows, Telegraf 1.26.0, nvidia-smi from official nvidia drivers 536.40

Docker

No response

Steps to reproduce

1.Activate the nvidia_smi input plugin 2.Verify that metrics exist 3.Notice that power related metrics are missing …

Expected behavior

Power related metrics should be available

Actual behavior

Power metrics are not available

Additional info

This appears to have started 2-3 months ago with an update to the nvidia drivers and nvidia-smi. I dont know the exact driver version that changed this, even looking at nvidia docs. It appears nvidia have changed the schema from v11 to v12, current output of nvidia-smi -x -q shows: <!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd">

As can be seen from the latest example included in this repo (3 months ago!), the previous version v11 had the following block available:

        <power_readings>
            <power_state>P0</power_state>
            <power_management>Supported</power_management>
            <power_draw>26.78 W</power_draw>
            <power_limit>70.00 W</power_limit>
            <default_power_limit>70.00 W</default_power_limit>
            <enforced_power_limit>70.00 W</enforced_power_limit>
            <min_power_limit>60.00 W</min_power_limit>
            <max_power_limit>70.00 W</max_power_limit>
        </power_readings>

Which is what the plugin code expects to find. (And has been this way for at least 4 years, in all previous versions of the schema that telegraf has supported, afaik)

The new schema has changed the way it reports this, with two different blocks:

     <gpu_power_readings>
             <power_state>P8</power_state>
             <power_draw>22.78 W</power_draw>
             <current_power_limit>336.00 W</current_power_limit>
             <requested_power_limit>336.00 W</requested_power_limit>
             <default_power_limit>320.00 W</default_power_limit>
             <min_power_limit>100.00 W</min_power_limit>
             <max_power_limit>336.00 W</max_power_limit>
     </gpu_power_readings>
     <module_power_readings>
             <power_state>P8</power_state>
             <power_draw>N/A</power_draw>
             <current_power_limit>N/A</current_power_limit>
             <requested_power_limit>N/A</requested_power_limit>
             <default_power_limit>N/A</default_power_limit>
             <min_power_limit>N/A</min_power_limit>
             <max_power_limit>N/A</max_power_limit>                                  
     </module_power_readings>

It seems the block we want now is gpu_power_readings, and it appears to be mostly the same for the purposes of this plugin (<power_draw> is the same). This was taken from nvidia-smi -x -q bundled with the nvidia drivers 536.40 on windows 10 for an RTX3080. I have also seen the same thing for a GTX1070 on a W11 machine, and also on linux. To reiterate, this seems to have happened in the last 2-3 months, for all platforms, as far as I am aware.

I assume this line needs to be adjusted to gpu_power_readings, and everything should work?

I have attached the full output of nvidia-smi -x -q (pruned the <processes> section for privacy) smi.log You can add it to the testdata samples if needed.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

@mbentley file a new issue

Hi,

Hi. Just stumbled onto this, as I’m having the same issue. It seems that this is not in a tagged / released version yet?

If you take a look at the PR that fixed this issue it has a milestone attached to it. That is when this fix will be released. In this case our next minor release, v1.28.0, which is in a few weeks.

I assume this line needs to be adjusted to gpu_power_readings, and everything should work?

If you do that, you break everyone else who is still on older drivers or the older format šŸ˜‰

What I would suggest is that we added a new struct for GPUPowerReadings and ModulePowerReadings so that if either of these read they are captured.

Then we can call setIfUsed on those fields.

Would you be willing to put up a PR?

Thanks!