ROCm: [Issue]: kernel NULL pointer dereference and device open freeze
Problem Description
I get “BUG: kernel NULL pointer dereference, address: 0000000000000000”, and clinfo freezes halfway through. After running a few other programs, all GPU-based applications freeze while starting. This is after upgrading from Ubuntu 23.04 and ROCm 5.7 to Ubuntu 23.10 and ROCm 5.7.1
dmesg.txt. The problem also occurred on the next boot with the vboxdrv module blocklisted.
Operating System
Ubuntu 23.10 (Mantic Minotaur)
CPU
AMD Ryzen 9 5900X 12-Core Processor
GPU
RX 6650 XT
ROCm Version
5.7.1
ROCm Component
Kernel
Steps to Reproduce
- Run firefox, glxinfo, glxgears, vulkaninfo, and vkcube. These do not crash
- Run /opt/rocm/bin/clinfo. It crashes after the first bit of output, saying it cannot compile the program
- Run a script using transformers/pytorch-rocm
- At the same time, start gpt4all. On the previous version of Ubuntu and ROCm, it crashed the first time I run it, but should have worked the second time
- While the transformers script is still running, run gpt4all again. Both freeze after this
- Now everything that needs the GPU will freeze when they try to open the device on startup
Output of /opt/rocm/bin/rocminfo --support
ROCk module is loaded
(output suddenly freezes here)
(after reboot:)
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 9 5900X 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 5900X 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32770204(0x1f4089c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32770204(0x1f4089c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32770204(0x1f4089c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1030
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6650 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 2048(0x800) KB
L3: 32768(0x8000) KB
Chip ID: 29679(0x73ef)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2765
BDFID: 12032
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 109
SDMA engine uCode:: 76
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1030
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
About this issue
- Original URL
- State: open
- Created 8 months ago
- Reactions: 16
- Comments: 74
Commits related to this issue
- Add patch fixing amdgpu compute (see ROCm/ROCm#2596) — committed to SwooshyCueb/pkgbuild-linux-kitsinger by SwooshyCueb 7 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" That commit causes NULL pointer dereferences in dmesgs when running applications using ROCm, including clinfo, blender, and PyTorch, ... — committed to mj22226/linux by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to namjaejeon/stable-kernel by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to namjaejeon/stable-kernel by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to Whissi/linux-stable by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to gregkh/linux by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" commit 0f35b0a7b8fa402adbffa2565047cdcc4c480153 upstream. That commit causes NULL pointer dereferences in dmesgs when running applic... — committed to johnny-mnemonic/linux-stable-rc by fee1-dead 6 months ago
- UPSTREAM: Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole" That commit causes NULL pointer dereferences in dmesgs when running applications using ROCm, including clinfo, blender, and... — committed to greenforce-project/chromeos-kernel-mirror by deleted user 4 months ago
Slightly different stacktrace that ends up with NULL pointer dereference. Reproducible with Arch Linux + AMD Ryzen 7 6800H. Happens when I switch Blender 4.1’s render engine to HIP.
Steps to reproduce:
Kernel: 6.6.2-arch1-1 #1 SMP PREEMPT_DYNAMIC
Rocm version:
dmesg
clinfo
rocminfo --support: rocminfo.txt
The situation is worse on Linux 6.6-rc7:
Now the TTY shows those three lines. After a few moments, the screen freezes, and once the ssh did as well. Not only is the ROCm unusable and my computer unable to shutdown, but now my entire computer is unusable after those moments pass with the new kernel.
Exact same problem also happens on RX 6600 + Ryzen 5 7600. Happens when Blender 4.0 tries to load anything ROCm related,
This seems to only affect AMD CPUs, I don’t see any Intel CPUs around here…
dmesg.txt rocminfo.txt
I had to compile the kernel on Arch to get it working that will work for the gentoo folks but having it fixed in the mainline linux kernel would fix it for everyone
I pinged Christian to see if there is any update. I’ll post here (or get him to update the amdgfx thread) once I hear back
Just upgraded
linuxpackage to6.7.2.arch1-1on Archlinux and the issue seem to be resolved at least for Blender.@OzzyHelix It can be applied to newest linux-next (next-20240125). It seems works on my side and seems also fixes rocm-OpenGL interop problem while using blender.
if this can get merged into the linux kernel the issue should be resolved for a lot of folks
For me
vm_update_mode=3made it worse, now also processes will stay stuck until I reset my PC. (Arch Linux + 5900x CPU + 5700xt GPU)I found a workaround! I got things to work without crashing with
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.runpm=0 amdgpu.vm_update_mode=3"in/etc/default/grub. I updated to ROCM 6.0.Thank you @terryrankine for https://github.com/ROCm/ROCm/issues/2766#issuecomment-1867179386 . Btw, the “what is actually happening” might be explained by https://lists.freedesktop.org/archives/amd-gfx/2023-October/100322.html .
I’ve also noticed that I can reproduce
GCVM_L2_PROTECTION_FAULT_STATUSif I tell ROCm to use the wrong GPU model. My GPU usesHSA_OVERRIDE_GFX_VERSION=10.3.0, but if I setHSA_OVERRIDE_GFX_VERSION=11.0.0or something, I get those errors. Docker is another environment where I get different errors in dmesg, so there seems to be a large userspace aspect to this bug (although userspace shouldn’t be able to freeze kernelspace).Pytorch
Stable diffusion and LlamaIndex are working after recompiling pytorch-rocm and bitsandbytes-rocm. https://github.com/ROCmSoftwarePlatform/pytorch/issues/1340
Hashcat
The Ubuntu repos’ version doesn’t work, as seen with
hashcat -b. It might work if I tried recompiling it, but I do not have a use case to bother trying.llama.cpp
llama.cpp and gpt4all work after putting
target_compile_options(ggml-rocm PRIVATE --offload-arch=gfx1030)in some CMakeLists.txt. Debug mode is now so slow (looks like #2625 and https://github.com/ROCm/ROCK-Kernel-Driver/issues/153 , when backtracing in gdb or withAMD_LOG_LEVEL=5 HSA_ENABLE_SDMA=0but this is an illusion) that it looks frozen, but release mode is somewhat fine. I am worried that the performance is not as good as it as before, because I was getting 47 tokens/s in April, and now it sometimes goes down to 30 tokens/s.OpenCL
clinfo doesn’t freeze the computer anymore, however I haven’t tested any OpenCL programs. Other peoples should test Blender, darktable, and DaVinci Resolve. I want to get into those programs and I have a few installed, but I didn’t have enough time to get familiar with how to use them properly.
Kernel
I will not close the issue yet. Blender and other apps still need to be tested by other people. Userspace should not be able to crash kernelspace with default options. This workaround apparently moves something to the CPU so it will slow things down, and the root cause needs to be addressed as someone mentioned on the mailing list.
Updating here for clarity. Christian is taking a look at the issue internally. Seems like some of the page tables aren’t CPU accessible.
@GZGavinZhao using daily Blender 4.1.0 beta does in fact fix my issue, thanks for suggesting! I’d tried their build of 4.0.2 and gave up because it crashed with the same issue.
Fixed for me with vanilla kernel 6.7.4.
the issue appears to be fixed on the
linux-zenpackage on version6.7.2-zen1-1-zenon Arch Linux but idk if the issue will returnCan confirm, after upgrading to 6.7.2, the issue is gone completely. Everything now seems to work! Although I’ve only tested on Blender, but I’m assuming that it works fine for other compute loads too.
Update: Blender often crashes when it’s trying to render with an iGPU + dGPU. Individual GPU rendering works perfectly well so far!
Memory access fault by GPU node-2 (Agent handle: 0x781767d00400) on address 0x7815cb5fe000. Reason: Page not present or supervisor privilege. zsh: IOT instruction (core dumped) blenderdmesg:Maybe this patch would help? Rather than just reverting that commit. https://patchwork.freedesktop.org/patch/575997/
Applying this patch to the Linux Zen Kernel version 6.6.8 makes blender work for me
Can confirm that this also happens on RX 7900 XTX + Ryzen 7950X on 6.6.8-arch1-1. Crashes when running blender and going to Edit -> Preferences. Also happens when running SVPManager. This leaves a zombie process that indefinitely blocks shutdown.
I bisected this, resulting in https://lists.freedesktop.org/archives/amd-gfx/2023-October/100298.html