rmm: [BUG] RMM log file name contains .dev0 extension when GPU device used is not 0
When rmm.enable_logging() is passed a log_file_name, a dev0 extension is always used even when the device in use is not device 0.
This reproducer demonstrates the issue:
import cudf
import rmm
from rmm._cuda.gpu import getDevice
import glob
import os
import time
rmm.enable_logging(log_file_name="/tmp/rmmlog.csv")
s = cudf.Series([1])
print(f'CUDA_VISIBLE_DEVICES={os.environ.get("CUDA_VISIBLE_DEVICES")}')
print(f"getDevice() returned: {getDevice()}")
print(f'RMM logs present on disk: {glob.glob("/tmp/rmmlog.*")}')
print("sleeping 10 seconds to check nvidia-smi on the host...")
time.sleep(10)
rmm.mr._flush_logs()
rmm.disable_logging()
output:
$> python /tmp/repro.py
CUDA_VISIBLE_DEVICES=1
getDevice() returned: 0
RMM logs present on disk: ['/tmp/rmmlog.dev0.csv']
sleeping 10 seconds to check nvidia-smi on the host...
nvidia-smi output:
...
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1594 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1594 G /usr/lib/xorg/Xorg 64MiB |
| 1 N/A N/A 1673 G /usr/bin/gnome-shell 65MiB |
| 1 N/A N/A 17629 C python 609MiB |
+-----------------------------------------------------------------------------+
The process in question here is 17629 using GPU 1.
The expected behavior is to create a logfile with an extension matching the GPU in use, which in the example above would be: /tmp/rmmlog.dev1.csv
NOTE: Just in case this is related, this demo was run in a container (hence the need to check nvidia-smi on the host) with multiple GPUs exposed from the host machine. Setting CUDA_VISIBLE_DEVICES=0 in the container shows the process running on GPU 0 on the host as expected.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (17 by maintainers)
Yes.
OK maybe 1 year was an exaggeration - apologies đ But even a week from now, it can be difficult â especially if youâre in some sort of shared computing environment where
CUDA_VISIBLE_DEVICESis used/changed frequently.Anyway - after discussing more with @rlratzel, we decided to:
Keep the current behaviour w.r.t suffixes and
CUDA_VISIBLE_DEVICESbut document it carefullyProvide a
get_log_filenamesAPI that returns a device-id-to-filename mapping. For now, the mapping would just be something like{1: "rmmlog.dev1.txt", 0: "rmmlog.dev0.txt", 2: "rmmlog.dev2.txt"}, where the keys are the âinternalâ device IDs. Users using logging in conjunction withCUDA_VISIBLE_DEVICESwill need to do extra bookkeeping to map the internal device IDs back to physical IDs.A mapping is returned instead of a list as (1) it gives us more flexibility as to how to name the output files, (2) when initializing RMM, the user can specify devices in any arbitrary order (e.g.,
devices=[2, 0, 1]and returning a list doesnât clarify which log filename corresponds to which device.cc @charlesbluca (in case you have thoughts here after having implemented RMM logging support in Dask-CUDA đ)
I think that depends. Some users who set
CUDA_VISIBLE_DEVICESwill know and expect the first GPU they set in that list to come out as device 0. Others might expect0to be the first physical GPU.Perhaps we need to log information about the GPU into the header of the log file so that the reader can unambiguously interpret it.
On second thoughts, maybe using a UUID just makes things less user-friendly?
From an end-user standpoint, device ID
0should refer unambiguously to the 0th physical GPU, i.e., the one on top in the output ofnvidia-smi), even though internally0can actually mean something else depending onCUDA_VISIBLE_DEVICES.To leave no room for ambiguity, @kkraus14 suggested we donât use device IDs
0...nin the log file names, but rather the device UUIDs â would that work for you @rlratzel?