dcgm-exporter: Profiling metrics not being collected
Hello,
dcgmi version: 2.2.9
I built dcgm-exporter from source and am running it on a single GPU (Tesla K80). I can’t seem to get profiling metrics to show up, though other metrics show up fine.
root@node-0:/etc/dcgm-exporter# dcgm-exporter -f etc/dcp-metrics-included.csv -a :9402
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] No configmap data specified, falling back to metric file etc/dcp-metrics-included.csv
WARN[0000] Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled
Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.
It looks like the profiling module fails to load:
root@node-0:/etc/dcgm-exporter# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
| 8 | Profiling | Failed to load |
+-----------+--------------------+--------------------------------------------------+
Though I’m not sure whether this is attributable to dcgm-exporter or dcgm, because when I can’t get the metrics to load even when using dcgmi directly:
root@node-0:/home/user# dcgmi dmon -e 1010
# Entity PCIRX
Id
Error setting watches. Result: This request is serviced by a module of DCGM that is not currently loaded
I’ve directly followed the instruction to build dcgm-exporter from source and the service runs inside a sidecar container that is responsible for collecting metrics.
How can I enable the collection of profiling metrics?
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 19
@yh0413,
Running the nv-hostengine inside a docker container if MIG is enabled may be tricky. The nv-hostengine uses MIG management API to get MIG profiles information (this is privileged functionality). By default a container would not have the proper capability to access MIG profiles information. For example, this is how you could run a docker container to allow it to access the MIG API:
Usually, when MIG is enabled, we recommend running nv-hostengine on bare metal and letting dcgm-exporter connect to it instead of running an embedded hostengine.
I hope that would help.
WBR, Nik
@babinskiy,
There may be several reasons. Could you provide us the debug logs from the nv-hostengine?
nv-hostengine -f host.log --log-level debug
WBR, Nik