go-nvml: Wrong output of `device.GetComputeRunningProcesses()` given multiple processes

Hi Kevin & Evan,

When multiple processes run on one GPU, I found the output of device.GetComputeRunningProcesses() is wrong – the values are misplaced across different process’s ProcessInfo. The bug appears on both V100-SXM2-16GB and GTX 1080 Ti, with CUDA version 10.2.

The testing code snippet is as follows.

package main

import "fmt"
import "github.com/NVIDIA/go-nvml/pkg/nvml"

func main() {
	nvml.Init()
	device, _ := nvml.DeviceGetHandleByIndex(0)
	processInfos, _ := device.GetComputeRunningProcesses()
	for i, processInfo := range processInfos {
		fmt.Printf("\t[%2d] ProcessInfo: %v\n", i, processInfo)
	}
}

On V100 machines, I got this.

$nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-?) # UUID manually removed

$nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   55C    P0   130W / 300W |  13244MiB / 16160MiB |     53%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     48959      C   python                                      1179MiB |
|    0     72754      C   python                                      9611MiB |
|    0     73422      C   python                                      2443MiB |
+-----------------------------------------------------------------------------+

$go run main.go
	[ 0] ProcessInfo: {72754 10077863936 73422 0}
	[ 1] ProcessInfo: {2561671168 48959 1236271104 0}
	[ 2] ProcessInfo: {0 0 0 0}
# it is expected to be
#	[ 0] ProcessInfo: {72754 10077863936 0 0} # {PID, 9611 MiB, 0, 0}
#	[ 1] ProcessInfo: {73422 2561671168 0 0}  # {PID, 2443 MiB, 0, 0}
#	[ 2] ProcessInfo: {48959 1236271104 0 0}  # {PID, 1179 MiB, 0, 0}

On GTX 1080 Ti machines, I got this.

$nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 3: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 4: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 5: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 6: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 7: GeForce GTX 1080 Ti (UUID: GPU-?) # UUID manually removed

$nvidia-smi -i 0
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:3D:00.0 Off |                  N/A |
| 38%   67C    P2   247W / 250W |   6630MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25907      C   python                                      2581MiB |
|    0     27576      C   python                                      4039MiB |
+-----------------------------------------------------------------------------+

$go run main.go
	[ 0] ProcessInfo: {25907 2706374656 27576 0}
	[ 1] ProcessInfo: {4235198464 0 0 0}
# it is expected to be
#	[ 0] ProcessInfo: {25907 2706374656 0 0} # {PID, 2581 MiB, 0, 0}
#	[ 1] ProcessInfo: {27576 4235198464 0 0} # {PID, 4039 MiB, 0, 0}

As a quick fix, I write a wrapper function to correct the faulty processInfo as it returns, hoping this bug could be solved in the near future.

After all, thanks for providing such nice library, especially the useful “thread-safe” feature. 😃

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 31 (13 by maintainers)

Most upvoted comments

Thanks for the report @qzweng and sorry for taking so long to respond. Considering that you’re on CUDA 10.2 our current hypothesis is that the nvmlProcessInfo_st that is returned by the NVML call has had fields added in the CUDA 11.x API and that we are not handling this correctly.

We will look into it. As a side note, could you check whether GetGraphicsRunningProcesses shows the same behaviour?

Update: looking at nvml.h from CUDA 10.2 we have:

typedef struct nvmlProcessInfo_st
{
    unsigned int pid;                 //!< Process ID
    unsigned long long usedGpuMemory; //!< Amount of used GPU memory in bytes.
                                      //! Under WDDM, \ref NVML_VALUE_NOT_AVAILABLE is always reported
                                      //! because Windows KMD manages all the memory and not the NVIDIA driver
} nvmlProcessInfo_t;

whereas in CUDA 11 this is

typedef struct nvmlProcessInfo_st
{
    unsigned int        pid;                //!< Process ID
    unsigned long long  usedGpuMemory;      //!< Amount of used GPU memory in bytes.
                                            //! Under WDDM, \ref NVML_VALUE_NOT_AVAILABLE is always reported
                                            //! because Windows KMD manages all the memory and not the NVIDIA driver
    unsigned int        gpuInstanceId;      //!< If MIG is enabled, stores a valid GPU instance ID. gpuInstanceId is set to
                                            //  0xFFFFFFFF otherwise.
    unsigned int        computeInstanceId;  //!< If MIG is enabled, stores a valid compute instance ID. computeInstanceId is set to
                                            //  0xFFFFFFFF otherwise.
} nvmlProcessInfo_t;

Which would explain the behaviour.

elezar on Jun 23, 2021

Same issue on CUDA 10.0 (driver 430.64) due to the struct size change of nvmlProcessInfo_st (https://github.com/NVIDIA/go-nvml/issues/21#issuecomment-866663067).

Similar issue for the official NVML Python bindings (nvidia-ml-py). nvidia-ml-py>=11.450.129 does not compatible with CUDA 10.x either (same reason of https://github.com/NVIDIA/go-nvml/issues/21#issuecomment-866663067). BTW, does anyone know how to report this to the maintainer of nvidia-ml-py? I cannot find any bug report link on PyPI.

XuehaiPan on Aug 11, 2021