nvtrust: Unable to determine the device handle for GPU0009:01:00.0: Unknown Error
Hello.
I enabled the cc mode with the provided gpu_cc_tool.py
with python3 ./gpu_cc_tool.py --gpu-name=H100 --set-cc-mode=on --reset-after-cc-mode-switch
.
Output:
NVIDIA GPU Tools version 535.86.06
Topo:
PCI 0009:00:00.0 0x10de:0x22b1
GPU 0009:01:00.0 H100-PCIE 0x2342 BAR0 0x661002000000
2023-12-29,11:51:50.039 INFO Selected GPU 0009:01:00.0 H100-PCIE 0x2342 BAR0 0x661002000000
2023-12-29,11:51:50.191 INFO GPU 0009:01:00.0 H100-PCIE 0x2342 BAR0 0x661002000000 CC mode set to on. It will be active after GPU reset.
2023-12-29,11:51:51.976 INFO GPU 0009:01:00.0 H100-PCIE 0x2342 BAR0 0x661002000000 was reset to apply the new CC mode.
Then, when doing nvidia-smi
, I have the following error: Unable to determine the device handle for GPU0009:01:00.0: Unknown Error
My architecture: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 72 On-line CPU(s) list: 0-71 Vendor ID: ARM Model: 0 Thread(s) per core: 1 Core(s) per socket: 72 Socket(s): 1 Stepping: r0p0 Frequency boost: disabled CPU max MHz: 3447.0000 CPU min MHz: 81.0000 BogoMIPS: 2000.00
Even if I had this error, switching to cc mode off was working. Unfortunately, I do not have the output, because I did sudo reboot
.
After reboot: No devices were found
. After this I ran nvidia-bug-report.sh
, that is attached in this discussion.
Finally, below the output of ubuntu-drivers devices
:
== /sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0 ==
modalias : pci:v000010DEd00002342sv000010DEsd00001809bc03sc02i00
vendor : NVIDIA Corporation
manual_install: True
driver : nvidia-driver-535-server - distro non-free
driver : nvidia-driver-535-server-open - distro non-free recommended
driver : nvidia-driver-535-open - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
About this issue
- Original URL
- State: open
- Created 6 months ago
- Comments: 20
Also, there is one thing we must be aware of. That is, the H100’s CC functionalities are still in the phase of early access and many features are subject to frequent change. I believe CCA will soon be supported in the coming future 😃
@CasellaJr If you want to use CC functionalities of H100 GPUs, you must have an AMD-SNP or Intel TDX supported machine available on your side and configure the machine according to NVIDIA’s manual. The deployment guide only mentions TDX and SNP. So, you cannot use H100 + CC on an unsupported machine but you can certainly utilize its computing resources to, e.g., train models more productively.
If you enable CC mode for H100 then it will block all host IO requests since H100 assumes the host platform is completely untrusted.