rmm: [BUG] RMM log file name contains .dev0 extension when GPU device used is not 0

When rmm.enable_logging() is passed a log_file_name, a dev0 extension is always used even when the device in use is not device 0.

This reproducer demonstrates the issue:

import cudf
import rmm
from rmm._cuda.gpu import getDevice

import glob
import os
import time

rmm.enable_logging(log_file_name="/tmp/rmmlog.csv")
s = cudf.Series([1])

print(f'CUDA_VISIBLE_DEVICES={os.environ.get("CUDA_VISIBLE_DEVICES")}')
print(f"getDevice() returned: {getDevice()}")
print(f'RMM logs present on disk: {glob.glob("/tmp/rmmlog.*")}')
print("sleeping 10 seconds to check nvidia-smi on the host...")
time.sleep(10)

rmm.mr._flush_logs()
rmm.disable_logging()

output:

$> python /tmp/repro.py
CUDA_VISIBLE_DEVICES=1
getDevice() returned: 0
RMM logs present on disk: ['/tmp/rmmlog.dev0.csv']
sleeping 10 seconds to check nvidia-smi on the host...

nvidia-smi output:

...
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1594      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1594      G   /usr/lib/xorg/Xorg                 64MiB |
|    1   N/A  N/A      1673      G   /usr/bin/gnome-shell               65MiB |
|    1   N/A  N/A     17629      C   python                            609MiB |
+-----------------------------------------------------------------------------+

The process in question here is 17629 using GPU 1.

The expected behavior is to create a logfile with an extension matching the GPU in use, which in the example above would be: /tmp/rmmlog.dev1.csv

NOTE: Just in case this is related, this demo was run in a container (hence the need to check nvidia-smi on the host) with multiple GPUs exposed from the host machine. Setting CUDA_VISIBLE_DEVICES=0 in the container shows the process running on GPU 0 on the host as expected.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (17 by maintainers)

Most upvoted comments

@jrhemstad one question re: your comment, so if I set CUDA_VISIBLE_DEVICES=1,0] and then did cudaSetDevice(0), is it actually device 1 beings set as the current device?

Yes.

OK maybe 1 year was an exaggeration - apologies 😃 But even a week from now, it can be difficult – especially if you’re in some sort of shared computing environment where CUDA_VISIBLE_DEVICES is used/changed frequently.

Anyway - after discussing more with @rlratzel, we decided to:

  1. Keep the current behaviour w.r.t suffixes and CUDA_VISIBLE_DEVICES but document it carefully

  2. Provide a get_log_filenames API that returns a device-id-to-filename mapping. For now, the mapping would just be something like {1: "rmmlog.dev1.txt", 0: "rmmlog.dev0.txt", 2: "rmmlog.dev2.txt"}, where the keys are the “internal” device IDs. Users using logging in conjunction with CUDA_VISIBLE_DEVICES will need to do extra bookkeeping to map the internal device IDs back to physical IDs.

A mapping is returned instead of a list as (1) it gives us more flexibility as to how to name the output files, (2) when initializing RMM, the user can specify devices in any arbitrary order (e.g., devices=[2, 0, 1] and returning a list doesn’t clarify which log filename corresponds to which device.

cc @charlesbluca (in case you have thoughts here after having implemented RMM logging support in Dask-CUDA 🙂)

I think that depends. Some users who set CUDA_VISIBLE_DEVICES will know and expect the first GPU they set in that list to come out as device 0. Others might expect 0 to be the first physical GPU.

Perhaps we need to log information about the GPU into the header of the log file so that the reader can unambiguously interpret it.

On second thoughts, maybe using a UUID just makes things less user-friendly?

From an end-user standpoint, device ID 0 should refer unambiguously to the 0th physical GPU, i.e., the one on top in the output of nvidia-smi), even though internally 0 can actually mean something else depending on CUDA_VISIBLE_DEVICES.

To leave no room for ambiguity, @kkraus14 suggested we don’t use device IDs 0...n in the log file names, but rather the device UUIDs – would that work for you @rlratzel?