ROCm: Pcie atomics not enabled, hostcall not supported on gfx1100 RX7900
I tried to use pytorch with ROCm, however it fails with
:1:rocvirtual.cpp :2902: 1550313166 us: 7740 : [tid:0x7f5681dfb6c0] Pcie atomics not enabled, hostcall not supported
:1:rocvirtual.cpp :3235: 1550313176 us: 7740 : [tid:0x7f5681dfb6c0] AQL dispatch failed!
HIP error: the operation cannot be performed in the present state
From previous issues in this repository, it seems like PCIe atomics were only a problem with gfx8 GPUs and old CPUs, so I’m wondering why I have this problem. I couldn’t find much information about which CPUs support this feature and which don’t. Is there a compatibility list somewhere?
ROCm version: 5.6 PyTorch: version: 2.1.0.dev20230901+rocm5.6 GPU: RX 7900 XT CPU: i5-11400F
dmesg | grep atomic
amdgpu 0000:03:00.0: amdgpu: PCIE atomic ops is not supported
However, I don’t get the infamous kfd: PCI rejects atomics
lspci -tv
-[0000:00]-+-00.0 Intel Corporation Device 4c53
+-01.0-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 HDMI/DP Audio
+-06.0-[04]----00.0 Micron/Crucial Technology P5 Plus NVMe PCIe SSD
+-14.0 Intel Corporation Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller
+-14.2 Intel Corporation Tiger Lake-H Shared SRAM
+-15.0 Intel Corporation Tiger Lake-H Serial IO I2C Controller #0
+-16.0 Intel Corporation Tiger Lake-H Management Engine Interface
+-17.0 Intel Corporation Device 43d2
+-1c.0-[05]--
+-1d.0-[06]--
+-1f.0 Intel Corporation B560 LPC/eSPI Controller
+-1f.3 Intel Corporation Tiger Lake-H HD Audio Controller
+-1f.4 Intel Corporation Tiger Lake-H SMBus Controller
+-1f.5 Intel Corporation Tiger Lake-H SPI Controller
\-1f.6 Intel Corporation Ethernet Connection (14) I219-V
lspci -vvvs 00:01.0
...
DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled, ARIFwd-
AtomicOpsCtl: ReqEn+ EgressBlck+
...
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 2
- Comments: 26
I’ve installed ROCm 6.0.0 and PyTorch 5.7 on my ubuntu installation and it surprisingly works.
@MattisBergmann This is the link that have the wheels to fix the above issue. Please try and let us know. https://repo.radeon.com/rocm/manylinux/.private-05b1d2750b39ef78de979ed9f59ce4c6/297/
also please refer to this issue for more detailed discussion: https://github.com/pytorch/pytorch/issues/103973