dcgm-exporter: Profiling metrics not being collected

Hello,

dcgmi version: 2.2.9

I built dcgm-exporter from source and am running it on a single GPU (Tesla K80). I can’t seem to get profiling metrics to show up, though other metrics show up fine.

root@node-0:/etc/dcgm-exporter# dcgm-exporter -f etc/dcp-metrics-included.csv  -a :9402
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] No configmap data specified, falling back to metric file etc/dcp-metrics-included.csv
WARN[0000] Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled

Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.

It looks like the profiling module fails to load:

root@node-0:/etc/dcgm-exporter# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

Though I’m not sure whether this is attributable to dcgm-exporter or dcgm, because when I can’t get the metrics to load even when using dcgmi directly:

root@node-0:/home/user# dcgmi dmon -e 1010
# Entity                 PCIRX
      Id
Error setting watches. Result: This request is serviced by a module of DCGM that is not currently loaded

I’ve directly followed the instruction to build dcgm-exporter from source and the service runs inside a sidecar container that is responsible for collecting metrics.

How can I enable the collection of profiling metrics?

About this issue

Most upvoted comments

@yh0413,

Running the nv-hostengine inside a docker container if MIG is enabled may be tricky. The nv-hostengine uses MIG management API to get MIG profiles information (this is privileged functionality). By default a container would not have the proper capability to access MIG profiles information. For example, this is how you could run a docker container to allow it to access the MIG API:

$ docker run --cap-add SYS_ADMIN --runtime=nvidia \
  --gpus all \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_MIG_CONFIG_DEVICES=all \
  -e NVIDIA_MIG_MONITOR_DEVICES=all \
  ...

Usually, when MIG is enabled, we recommend running nv-hostengine on bare metal and letting dcgm-exporter connect to it instead of running an embedded hostengine.

I hope that would help.

WBR, Nik

@babinskiy,

There may be several reasons. Could you provide us the debug logs from the nv-hostengine? nv-hostengine -f host.log --log-level debug

WBR, Nik