go-nvml: Wrong output of `device.GetComputeRunningProcesses()` given multiple processes
Hi Kevin & Evan,
When multiple processes run on one GPU, I found the output of device.GetComputeRunningProcesses()
is wrong – the values are misplaced across different process’s ProcessInfo
. The bug appears on both V100-SXM2-16GB and GTX 1080 Ti, with CUDA version 10.2.
The testing code snippet is as follows.
package main
import "fmt"
import "github.com/NVIDIA/go-nvml/pkg/nvml"
func main() {
nvml.Init()
device, _ := nvml.DeviceGetHandleByIndex(0)
processInfos, _ := device.GetComputeRunningProcesses()
for i, processInfo := range processInfos {
fmt.Printf("\t[%2d] ProcessInfo: %v\n", i, processInfo)
}
}
On V100 machines, I got this.
$nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-?) # UUID manually removed
$nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:08.0 Off | 0 |
| N/A 55C P0 130W / 300W | 13244MiB / 16160MiB | 53% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 48959 C python 1179MiB |
| 0 72754 C python 9611MiB |
| 0 73422 C python 2443MiB |
+-----------------------------------------------------------------------------+
$go run main.go
[ 0] ProcessInfo: {72754 10077863936 73422 0}
[ 1] ProcessInfo: {2561671168 48959 1236271104 0}
[ 2] ProcessInfo: {0 0 0 0}
# it is expected to be
# [ 0] ProcessInfo: {72754 10077863936 0 0} # {PID, 9611 MiB, 0, 0}
# [ 1] ProcessInfo: {73422 2561671168 0 0} # {PID, 2443 MiB, 0, 0}
# [ 2] ProcessInfo: {48959 1236271104 0 0} # {PID, 1179 MiB, 0, 0}
On GTX 1080 Ti machines, I got this.
$nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 3: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 4: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 5: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 6: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 7: GeForce GTX 1080 Ti (UUID: GPU-?) # UUID manually removed
$nvidia-smi -i 0
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 00000000:3D:00.0 Off | N/A |
| 38% 67C P2 247W / 250W | 6630MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 25907 C python 2581MiB |
| 0 27576 C python 4039MiB |
+-----------------------------------------------------------------------------+
$go run main.go
[ 0] ProcessInfo: {25907 2706374656 27576 0}
[ 1] ProcessInfo: {4235198464 0 0 0}
# it is expected to be
# [ 0] ProcessInfo: {25907 2706374656 0 0} # {PID, 2581 MiB, 0, 0}
# [ 1] ProcessInfo: {27576 4235198464 0 0} # {PID, 4039 MiB, 0, 0}
As a quick fix, I write a wrapper function to correct the faulty processInfo as it returns, hoping this bug could be solved in the near future.
After all, thanks for providing such nice library, especially the useful “thread-safe” feature. 😃
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 31 (13 by maintainers)
Thanks for the report @qzweng and sorry for taking so long to respond. Considering that you’re on CUDA 10.2 our current hypothesis is that the
nvmlProcessInfo_st
that is returned by the NVML call has had fields added in the CUDA 11.x API and that we are not handling this correctly.We will look into it. As a side note, could you check whether
GetGraphicsRunningProcesses
shows the same behaviour?Update: looking at
nvml.h
from CUDA 10.2 we have:whereas in CUDA 11 this is
Which would explain the behaviour.
Same issue on CUDA 10.0 (driver 430.64) due to the struct size change of
nvmlProcessInfo_st
(https://github.com/NVIDIA/go-nvml/issues/21#issuecomment-866663067).Similar issue for the official NVML Python bindings (
nvidia-ml-py
).nvidia-ml-py>=11.450.129
does not compatible with CUDA 10.x either (same reason of https://github.com/NVIDIA/go-nvml/issues/21#issuecomment-866663067). BTW, does anyone know how to report this to the maintainer ofnvidia-ml-py
? I cannot find any bug report link on PyPI.