ROCm: [Issue]: kernel NULL pointer dereference and device open freeze

Problem Description

I get “BUG: kernel NULL pointer dereference, address: 0000000000000000”, and clinfo freezes halfway through. After running a few other programs, all GPU-based applications freeze while starting. This is after upgrading from Ubuntu 23.04 and ROCm 5.7 to Ubuntu 23.10 and ROCm 5.7.1

dmesg.txt. The problem also occurred on the next boot with the vboxdrv module blocklisted.

Operating System

Ubuntu 23.10 (Mantic Minotaur)

CPU

AMD Ryzen 9 5900X 12-Core Processor

GPU

RX 6650 XT

ROCm Version

5.7.1

ROCm Component

Kernel

Steps to Reproduce

  1. Run firefox, glxinfo, glxgears, vulkaninfo, and vkcube. These do not crash
  2. Run /opt/rocm/bin/clinfo. It crashes after the first bit of output, saying it cannot compile the program
  3. Run a script using transformers/pytorch-rocm
  4. At the same time, start gpt4all. On the previous version of Ubuntu and ROCm, it crashed the first time I run it, but should have worked the second time
  5. While the transformers script is still running, run gpt4all again. Both freeze after this
  6. Now everything that needs the GPU will freeze when they try to open the device on startup

Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
(output suddenly freezes here)
(after reboot:)
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 5900X 12-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 5900X 12-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32770204(0x1f4089c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32770204(0x1f4089c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32770204(0x1f4089c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6650 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
    L3:                      32768(0x8000) KB                   
  Chip ID:                 29679(0x73ef)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2765                               
  BDFID:                   12032                              
  Internal Node ID:        1                                  
  Compute Unit:            32                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 109                                
  SDMA engine uCode::      76                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Reactions: 16
  • Comments: 74

Commits related to this issue

Most upvoted comments

Slightly different stacktrace that ends up with NULL pointer dereference. Reproducible with Arch Linux + AMD Ryzen 7 6800H. Happens when I switch Blender 4.1’s render engine to HIP.

Steps to reproduce:

  1. Download Blender 4.1.0 Alpha from https://builder.blender.org/download/daily/blender-4.1.0-alpha+main.1b6cd937ffc8-linux.x86_64-release.tar.xz
  2. Launch. Switch world rendering engine from Eevee to Cycles
  3. Open Preference - systems. Switch the rendering engine from CPU to HIP
  4. Kernel crashes

Kernel: 6.6.2-arch1-1 #1 SMP PREEMPT_DYNAMIC

Rocm version:

extra/rocm-hip-libraries 5.7.1-2 [installed]
    Develop certain applications using HIP and libraries for AMD platforms
extra/rocm-hip-runtime 5.7.1-2 [installed]
    Packages to run HIP applications on the AMD platform
extra/rocm-hip-sdk 5.7.1-2 [installed]
    Develop applications using HIP and libraries for AMD platforms
extra/rocprim 5.7.1-1 [installed]
    Header-only library providing HIP parallel primitives
extra/rocthrust 5.7.1-1 [installed]
    Port of the Thrust parallel algorithm library atop HIP/ROCm
dmesg
Nov 26 01:10:31 code01 kernel: amdgpu 0000:62:00.0: amdgpu: bo 000000004fd46f03 va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000200
Nov 26 01:10:31 code01 kernel: amdgpu: Failed to map VA 0x800000000000 in vm. ret -22
Nov 26 01:10:31 code01 kernel: amdgpu: Failed to map bo to gpuvm
Nov 26 01:10:31 code01 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000002
Nov 26 01:10:31 code01 kernel: #PF: supervisor read access in kernel mode
Nov 26 01:10:31 code01 kernel: #PF: error_code(0x0000) - not-present page
Nov 26 01:10:31 code01 kernel: PGD 29631b067 P4D 29631b067 PUD 53a5da067 PMD 0 
Nov 26 01:10:31 code01 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Nov 26 01:10:31 code01 kernel: CPU: 2 PID: 27379 Comm: blender-4.1 Not tainted 6.6.2-arch1-1 #1 11215f9ba7ddfb51644674a5b2ced71612c62fe9
Nov 26 01:10:31 code01 kernel: Hardware name: MECHREVO Code01 Ver2.0/Code01 Ver2.0, BIOS 0016.006.9 08/23/2022
Nov 26 01:10:31 code01 kernel: RIP: 0010:__list_add_valid_or_report+0x1a/0xa0
Nov 26 01:10:31 code01 kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 d0 48 85 f6 74 2a 48 85 d2 74 3a 48 8b 52 08 48 39 f2 75 41 <4c> 8b 02 49 39 c0 75 4c 48 >
Nov 26 01:10:31 code01 kernel: RSP: 0018:ffffa3b295777af0 EFLAGS: 00010246
Nov 26 01:10:31 code01 kernel: RAX: ffff90d99373e350 RBX: ffffa3b295777b30 RCX: ffff90d9880a0000
Nov 26 01:10:31 code01 kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffffa3b295777b30
Nov 26 01:10:31 code01 kernel: RBP: ffff90d99373e350 R08: 0000000000000040 R09: ffff90da178c7b00
Nov 26 01:10:31 code01 kernel: R10: 00000000000390a0 R11: ffff90d985064840 R12: ffff90d99373e340
Nov 26 01:10:31 code01 kernel: R13: 0000000000000002 R14: ffffa3b295777b30 R15: 0000000000000000
Nov 26 01:10:31 code01 kernel: FS:  00007f6d332ef580(0000) GS:ffff90e09ee80000(0000) knlGS:0000000000000000
Nov 26 01:10:31 code01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 26 01:10:31 code01 kernel: CR2: 0000000000000002 CR3: 00000003b1b9e000 CR4: 0000000000f50ee0
Nov 26 01:10:31 code01 kernel: PKRU: 55555554
Nov 26 01:10:31 code01 kernel: Call Trace:
Nov 26 01:10:31 code01 kernel:  <TASK>
Nov 26 01:10:31 code01 kernel:  ? __die+0x23/0x70
Nov 26 01:10:31 code01 kernel:  ? page_fault_oops+0x171/0x4e0
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  ? amdgpu_ttm_tt_populate+0x7c/0xb0 [amdgpu 0401721894ca8f32d5d0b424349ce03960632e80]
Nov 26 01:10:31 code01 kernel:  ? exc_page_fault+0x7f/0x180
Nov 26 01:10:31 code01 kernel:  ? asm_exc_page_fault+0x26/0x30
Nov 26 01:10:31 code01 kernel:  ? __list_add_valid_or_report+0x1a/0xa0
Nov 26 01:10:31 code01 kernel:  __mutex_add_waiter+0x23/0x60
Nov 26 01:10:31 code01 kernel:  __mutex_lock.constprop.0+0x2a4/0x6a0
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  add_kgd_mem_to_kfd_bo_list+0x23/0xa0 [amdgpu 0401721894ca8f32d5d0b424349ce03960632e80]
Nov 26 01:10:31 code01 kernel:  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x660/0xa40 [amdgpu 0401721894ca8f32d5d0b424349ce03960632e80]
Nov 26 01:10:31 code01 kernel:  kfd_process_alloc_gpuvm+0x32/0x100 [amdgpu 0401721894ca8f32d5d0b424349ce03960632e80]
Nov 26 01:10:31 code01 kernel:  kfd_process_device_init_vm+0x267/0x320 [amdgpu 0401721894ca8f32d5d0b424349ce03960632e80]
Nov 26 01:10:31 code01 kernel:  kfd_ioctl_acquire_vm+0x89/0xc0 [amdgpu 0401721894ca8f32d5d0b424349ce03960632e80]
Nov 26 01:10:31 code01 kernel:  kfd_ioctl+0x3cc/0x4e0 [amdgpu 0401721894ca8f32d5d0b424349ce03960632e80]
Nov 26 01:10:31 code01 kernel:  ? __pfx_kfd_ioctl_acquire_vm+0x10/0x10 [amdgpu 0401721894ca8f32d5d0b424349ce03960632e80]
Nov 26 01:10:31 code01 kernel:  __x64_sys_ioctl+0x97/0xd0
Nov 26 01:10:31 code01 kernel:  do_syscall_64+0x60/0x90
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  ? do_syscall_64+0x6c/0x90
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  ? do_syscall_64+0x6c/0x90
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
Nov 26 01:10:31 code01 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 26 01:10:31 code01 kernel:  ? do_syscall_64+0x6c/0x90
Nov 26 01:10:31 code01 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Nov 26 01:10:31 code01 kernel: RIP: 0033:0x7f6d32f2a3af
Nov 26 01:10:31 code01 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 >
Nov 26 01:10:31 code01 kernel: RSP: 002b:00007ffd309e1fd0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 26 01:10:31 code01 kernel: RAX: ffffffffffffffda RBX: 00007ffd309e20c0 RCX: 00007f6d32f2a3af
Nov 26 01:10:31 code01 kernel: RDX: 00007ffd309e2140 RSI: 0000000040084b15 RDI: 0000000000000017
Nov 26 01:10:31 code01 kernel: RBP: 00007ffd309e2140 R08: 0000000000000015 R09: 0000000000000008
Nov 26 01:10:31 code01 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 0000000040084b15
Nov 26 01:10:31 code01 kernel: R13: 0000000000000017 R14: 00007f6c22304560 R15: 00007f6d1c4ae180
Nov 26 01:10:31 code01 kernel:  </TASK>
Nov 26 01:10:31 code01 kernel: Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device intel_rapl_msr intel_rapl_common snd_soc_acp6x_mach snd_soc_dmic snd_acp6x_pdm_dma snd_so>
Nov 26 01:10:31 code01 kernel:  videobuf2_common snd_soc_acpi ledtrig_audio mdio_devres sp5100_tco hid_multitouch snd cryptd ecdh_generic mc rapl pcspkr sparse_keymap wmi_bmof thunderbolt k10t>
Nov 26 01:10:31 code01 kernel: CR2: 0000000000000002
Nov 26 01:10:31 code01 kernel: ---[ end trace 0000000000000000 ]---
Nov 26 01:10:31 code01 kernel: RIP: 0010:__list_add_valid_or_report+0x1a/0xa0
Nov 26 01:10:31 code01 kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 d0 48 85 f6 74 2a 48 85 d2 74 3a 48 8b 52 08 48 39 f2 75 41 <4c> 8b 02 49 39 c0 75 4c 48 >
Nov 26 01:10:31 code01 kernel: RSP: 0018:ffffa3b295777af0 EFLAGS: 00010246
Nov 26 01:10:31 code01 kernel: RAX: ffff90d99373e350 RBX: ffffa3b295777b30 RCX: ffff90d9880a0000
Nov 26 01:10:31 code01 kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffffa3b295777b30
Nov 26 01:10:31 code01 kernel: RBP: ffff90d99373e350 R08: 0000000000000040 R09: ffff90da178c7b00
Nov 26 01:10:31 code01 kernel: R10: 00000000000390a0 R11: ffff90d985064840 R12: ffff90d99373e340
Nov 26 01:10:31 code01 kernel: R13: 0000000000000002 R14: ffffa3b295777b30 R15: 0000000000000000
Nov 26 01:10:31 code01 kernel: FS:  00007f6d332ef580(0000) GS:ffff90e09ee80000(0000) knlGS:0000000000000000
Nov 26 01:10:31 code01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 26 01:10:31 code01 kernel: CR2: 0000000000000002 CR3: 00000003b1b9e000 CR4: 0000000000f50ee0
Nov 26 01:10:31 code01 kernel: PKRU: 55555554
Nov 26 01:10:31 code01 kernel: note: blender-4.1[27379] exited with irqs disabled
Nov 26 01:10:31 code01 kernel: note: blender-4.1[27379] exited with preempt_count 2
clinfo
Number of platforms:				 1
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.1 AMD-APP.dbg (3590.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback 


  Platform Name:				 AMD Accelerated Parallel Processing
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Board name:					 AMD Radeon Graphics
  Device Topology:				 PCI[ B#98, D#0, F#0 ]
  Max compute units:				 6
  Max work items dimensions:			 3
    Max work items[0]:				 1024
    Max work items[1]:				 1024
    Max work items[2]:				 1024
  Max work group size:				 256
  Preferred vector width char:			 4
  Preferred vector width short:			 2
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 4
  Native vector width short:			 2
  Native vector width int:			 1
  Native vector width long:			 1
  Native vector width float:			 1
  Native vector width double:			 1
  Max clock frequency:				 2200Mhz
  Address bits:					 64
  Max memory allocation:			 912680544
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 16384
  Max image 3D height:				 16384
  Max image 3D depth:				 8192
  Max samplers within kernel:			 16
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 Yes
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 16384
  Global memory size:				 1073741824
  Constant buffer size:				 912680544
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Max pipe arguments:				 16
  Max pipe active reservations:			 16
  Max pipe packet size:				 912680544
  Max global variable size:			 912680544
  Max global variable preferred total size:	 1073741824
  Max read/write image args:			 64
  Max on device events:				 1024
  Queue on device max size:			 8388608
  Max on device queues:				 1
  Queue on device preferred size:		 262144
  SVM capabilities:				 
    Coarse grain buffer:			 Yes
    Fine grain buffer:				 Yes
    Fine grain system:				 No
    Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 32
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Queue on Device properties:				 
    Out-of-Order:				 Yes
    Profiling :					 Yes
  Platform ID:					 0x7fe994e0f010
  Name:						 gfx1035
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 2.0 
  Driver version:				 3590.0 (HSA1.1,LC)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 2.0 
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

rocminfo --support: rocminfo.txt

The situation is worse on Linux 6.6-rc7:

Oct 23 07:17:04 daniel-desktop3 kernel: amdgpu 0000:2f:00.0: amdgpu: bo 00000000b0b9bc73 va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000200
Oct 23 07:17:04 daniel-desktop3 kernel: amdgpu: Failed to map VA 0x800000000000 in vm. ret -22
Oct 23 07:17:04 daniel-desktop3 kernel: amdgpu: Failed to map bo to gpuvm
Oct 23 07:17:04 daniel-desktop3 kernel: ------------[ cut here ]------------
Oct 23 07:17:04 daniel-desktop3 kernel: refcount_t: addition on 0; use-after-free.
Oct 23 07:17:04 daniel-desktop3 kernel: WARNING: CPU: 20 PID: 88093 at lib/refcount.c:25 refcount_warn_saturate+0x12e/0x150
Oct 23 07:17:04 daniel-desktop3 kernel: Modules linked in:
Oct 23 07:17:04 daniel-desktop3 kernel: CPU: 20 PID: 88093 Comm: clinfo Not tainted 6.6.0-rc7 #2
Oct 23 07:17:04 daniel-desktop3 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C37/X570-A PRO (MS-7C37), BIOS H.I0 08/10/2022
Oct 23 07:17:04 daniel-desktop3 kernel: RIP: 0010:refcount_warn_saturate+0x12e/0x150
Oct 23 07:17:04 daniel-desktop3 kernel: Code: 1d b3 85 70 02 80 fb 01 0f 87 d7 2f 23 01 83 e3 01 0f 85 52 ff ff ff 48 c7 c7 18 24 91 94 c6 05 93 85 70 02 01 e8 02 f4 8e >
Oct 23 07:17:04 daniel-desktop3 kernel: RSP: 0018:ffffc900066cbc60 EFLAGS: 00010246
Oct 23 07:17:04 daniel-desktop3 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Oct 23 07:17:04 daniel-desktop3 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Oct 23 07:17:04 daniel-desktop3 kernel: RBP: ffffc900066cbc68 R08: 0000000000000000 R09: 0000000000000000
Oct 23 07:17:04 daniel-desktop3 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Oct 23 07:17:04 daniel-desktop3 kernel: R13: ffff888100f62338 R14: 0000000000000000 R15: ffff8881661e8000
Oct 23 07:17:04 daniel-desktop3 kernel: FS:  00007fa3ac472740(0000) GS:ffff8887fef00000(0000) knlGS:0000000000000000
Oct 23 07:17:04 daniel-desktop3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 23 07:17:04 daniel-desktop3 kernel: CR2: 00005573ed2b1000 CR3: 000000014ae64000 CR4: 0000000000750ee0
Oct 23 07:17:04 daniel-desktop3 kernel: PKRU: 55555554
Oct 23 07:17:04 daniel-desktop3 kernel: Call Trace:
Oct 23 07:17:04 daniel-desktop3 kernel:  <TASK>
Oct 23 07:17:04 daniel-desktop3 kernel:  ? show_regs+0x6d/0x80
Oct 23 07:17:04 daniel-desktop3 kernel:  ? __warn+0x89/0x160
Oct 23 07:17:04 daniel-desktop3 kernel:  ? refcount_warn_saturate+0x12e/0x150
Oct 23 07:17:04 daniel-desktop3 kernel:  ? report_bug+0x17e/0x1b0
Oct 23 07:17:04 daniel-desktop3 kernel:  ? handle_bug+0x51/0xa0
Oct 23 07:17:04 daniel-desktop3 kernel:  ? exc_invalid_op+0x18/0x80
Oct 23 07:17:04 daniel-desktop3 kernel:  ? asm_exc_invalid_op+0x1b/0x20
Oct 23 07:17:04 daniel-desktop3 kernel:  ? refcount_warn_saturate+0x12e/0x150
Oct 23 07:17:04 daniel-desktop3 kernel:  dma_resv_add_fence+0x1f0/0x240
Oct 23 07:17:04 daniel-desktop3 kernel:  amdgpu_amdkfd_gpuvm_acquire_process_vm+0x223/0x540
Oct 23 07:17:04 daniel-desktop3 kernel:  kfd_process_device_init_vm+0xc0/0x320
Oct 23 07:17:04 daniel-desktop3 kernel:  ? kfd_ioctl_get_process_apertures_new+0x190/0x380
Oct 23 07:17:04 daniel-desktop3 kernel:  kfd_ioctl_acquire_vm+0x96/0xd0
Oct 23 07:17:04 daniel-desktop3 kernel:  kfd_ioctl+0x44a/0x580
Oct 23 07:17:04 daniel-desktop3 kernel:  ? __pfx_kfd_ioctl_acquire_vm+0x10/0x10
Oct 23 07:17:04 daniel-desktop3 kernel:  __x64_sys_ioctl+0xa3/0xf0
Oct 23 07:17:04 daniel-desktop3 kernel:  do_syscall_64+0x5c/0x90
Oct 23 07:17:04 daniel-desktop3 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Oct 23 07:17:04 daniel-desktop3 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Oct 23 07:17:04 daniel-desktop3 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Oct 23 07:17:04 daniel-desktop3 kernel:  ? do_syscall_64+0x68/0x90
Oct 23 07:17:04 daniel-desktop3 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Oct 23 07:17:04 daniel-desktop3 kernel:  ? do_syscall_64+0x68/0x90
Oct 23 07:17:04 daniel-desktop3 kernel:  ? do_syscall_64+0x68/0x90
Oct 23 07:17:04 daniel-desktop3 kernel:  ? do_syscall_64+0x68/0x90
Oct 23 07:17:04 daniel-desktop3 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Oct 23 07:17:04 daniel-desktop3 kernel: RIP: 0033:0x7fa3abf238ef
Oct 23 07:17:04 daniel-desktop3 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f >
Oct 23 07:17:04 daniel-desktop3 kernel: RSP: 002b:00007ffc754e4830 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 23 07:17:04 daniel-desktop3 kernel: RAX: ffffffffffffffda RBX: 00007ffc754e49a0 RCX: 00007fa3abf238ef
Oct 23 07:17:04 daniel-desktop3 kernel: RDX: 00007ffc754e49a0 RSI: 0000000040084b15 RDI: 0000000000000008
Oct 23 07:17:04 daniel-desktop3 kernel: RBP: 0000000040084b15 R08: 0000000000000007 R09: 0000000000000001
Oct 23 07:17:04 daniel-desktop3 kernel: R10: 00005573ed285ac0 R11: 0000000000000246 R12: 00005573ed281250
Oct 23 07:17:04 daniel-desktop3 kernel: R13: 0000000000000008 R14: 00007fa39d0bd040 R15: 0000000000000060
Oct 23 07:17:04 daniel-desktop3 kernel:  </TASK>
Oct 23 07:17:04 daniel-desktop3 kernel: ---[ end trace 0000000000000000 ]---

Now the TTY shows those three lines. After a few moments, the screen freezes, and once the ssh did as well. Not only is the ROCm unusable and my computer unable to shutdown, but now my entire computer is unusable after those moments pass with the new kernel.

Exact same problem also happens on RX 6600 + Ryzen 5 7600. Happens when Blender 4.0 tries to load anything ROCm related,

  • has a 80% chance to instantly kernel panic,
  • 15% chance to kernel oops,
  • 5% chance to “stuck” 2 CPU cores and then completely freeze the system afterwards (but kernel is still alive).

This seems to only affect AMD CPUs, I don’t see any Intel CPUs around here…

dmesg.txt rocminfo.txt

I had to compile the kernel on Arch to get it working that will work for the gentoo folks but having it fixed in the mainline linux kernel would fix it for everyone

I pinged Christian to see if there is any update. I’ll post here (or get him to update the amdgfx thread) once I hear back

Just upgraded linux package to 6.7.2.arch1-1 on Archlinux and the issue seem to be resolved at least for Blender.

Maybe this patch would help? Rather than just reverting that commit. https://patchwork.freedesktop.org/patch/575997/

do we know what Linux kernel version this patch is for I’m trying to find it but I am just confused

@OzzyHelix It can be applied to newest linux-next (next-20240125). It seems works on my side and seems also fixes rocm-OpenGL interop problem while using blender.

if this can get merged into the linux kernel the issue should be resolved for a lot of folks

For me vm_update_mode=3 made it worse, now also processes will stay stuck until I reset my PC. (Arch Linux + 5900x CPU + 5700xt GPU)

I found a workaround! I got things to work without crashing with GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.runpm=0 amdgpu.vm_update_mode=3" in /etc/default/grub. I updated to ROCM 6.0.

Thank you @terryrankine for https://github.com/ROCm/ROCm/issues/2766#issuecomment-1867179386 . Btw, the “what is actually happening” might be explained by https://lists.freedesktop.org/archives/amd-gfx/2023-October/100322.html .

I’ve also noticed that I can reproduce GCVM_L2_PROTECTION_FAULT_STATUS if I tell ROCm to use the wrong GPU model. My GPU uses HSA_OVERRIDE_GFX_VERSION=10.3.0, but if I set HSA_OVERRIDE_GFX_VERSION=11.0.0 or something, I get those errors. Docker is another environment where I get different errors in dmesg, so there seems to be a large userspace aspect to this bug (although userspace shouldn’t be able to freeze kernelspace).

Pytorch

Stable diffusion and LlamaIndex are working after recompiling pytorch-rocm and bitsandbytes-rocm. https://github.com/ROCmSoftwarePlatform/pytorch/issues/1340

Hashcat

The Ubuntu repos’ version doesn’t work, as seen with hashcat -b. It might work if I tried recompiling it, but I do not have a use case to bother trying.

llama.cpp

llama.cpp and gpt4all work after putting target_compile_options(ggml-rocm PRIVATE --offload-arch=gfx1030) in some CMakeLists.txt. Debug mode is now so slow (looks like #2625 and https://github.com/ROCm/ROCK-Kernel-Driver/issues/153 , when backtracing in gdb or with AMD_LOG_LEVEL=5 HSA_ENABLE_SDMA=0 but this is an illusion) that it looks frozen, but release mode is somewhat fine. I am worried that the performance is not as good as it as before, because I was getting 47 tokens/s in April, and now it sometimes goes down to 30 tokens/s.

OpenCL

clinfo doesn’t freeze the computer anymore, however I haven’t tested any OpenCL programs. Other peoples should test Blender, darktable, and DaVinci Resolve. I want to get into those programs and I have a few installed, but I didn’t have enough time to get familiar with how to use them properly.

Kernel

I will not close the issue yet. Blender and other apps still need to be tested by other people. Userspace should not be able to crash kernelspace with default options. This workaround apparently moves something to the CPU so it will slow things down, and the root cause needs to be addressed as someone mentioned on the mailing list.

Updating here for clarity. Christian is taking a look at the issue internally. Seems like some of the page tables aren’t CPU accessible.

@GZGavinZhao using daily Blender 4.1.0 beta does in fact fix my issue, thanks for suggesting! I’d tried their build of 4.0.2 and gave up because it crashed with the same issue.

I have it too. When I run clinfo, nothing is output, but dmesg shows severe kernel issues. […] * Kernel: 6.7-pf5, * SoC: AMD 7840U (Zen 4 “Phoenix”, integrated graphics: Radeon 780M), * ROCm version: 6.0.2 (via opencl-amd AUR package), * […]

Attached a dmesg output from after issuing clinfo: winmax2-clinfo-error.dmesg.log. […]

Fixed for me with vanilla kernel 6.7.4.

the issue appears to be fixed on the linux-zen package on version 6.7.2-zen1-1-zen on Arch Linux but idk if the issue will return

Can confirm, after upgrading to 6.7.2, the issue is gone completely. Everything now seems to work! Although I’ve only tested on Blender, but I’m assuming that it works fine for other compute loads too.

Update: Blender often crashes when it’s trying to render with an iGPU + dGPU. Individual GPU rendering works perfectly well so far! Memory access fault by GPU node-2 (Agent handle: 0x781767d00400) on address 0x7815cb5fe000. Reason: Page not present or supervisor privilege. zsh: IOT instruction (core dumped) blender dmesg:

[ 3301.257727] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257732] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x000078154cf9f000 from client 0x1b (UTCL2)
[ 3301.257735] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00841051
[ 3301.257737] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: TCP (0x8)
[ 3301.257738] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x1
[ 3301.257739] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257740] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x5
[ 3301.257741] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257742] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x1
[ 3301.257747] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257749] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x00007817c7a5c000 from client 0x1b (UTCL2)
[ 3301.257751] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257752] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257754] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257755] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257756] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257757] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257758] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0
[ 3301.257762] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257764] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x00007817c7abc000 from client 0x1b (UTCL2)
[ 3301.257765] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257767] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257768] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257769] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257770] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257771] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257772] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0
[ 3301.257776] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257778] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x00007817c7a5d000 from client 0x1b (UTCL2)
[ 3301.257779] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257780] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257781] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257782] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257783] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257784] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257785] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0
[ 3301.257789] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257791] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x00007817c7abd000 from client 0x1b (UTCL2)
[ 3301.257792] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257793] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257794] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257795] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257796] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257797] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257798] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0
[ 3301.257803] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257805] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x00007815cb5fe000 from client 0x1b (UTCL2)
[ 3301.257806] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257807] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257809] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257810] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257811] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257812] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257813] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0
[ 3301.257817] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257819] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x00007815cb61e000 from client 0x1b (UTCL2)
[ 3301.257821] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257822] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257823] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257824] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257825] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257826] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257827] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0
[ 3301.257831] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257833] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x00007815cb6de000 from client 0x1b (UTCL2)
[ 3301.257835] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257836] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257837] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257838] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257839] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257840] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257841] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0
[ 3301.257845] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257847] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x00007815cb6be000 from client 0x1b (UTCL2)
[ 3301.257849] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257850] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257851] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257852] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257853] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257854] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257855] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0
[ 3301.257860] amdgpu 0000:11:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32802, for process blender pid 9571 thread blender pid 9571)
[ 3301.257861] amdgpu 0000:11:00.0: amdgpu:   in page starting at address 0x000078154cf9f000 from client 0x1b (UTCL2)
[ 3301.257863] amdgpu 0000:11:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 3301.257864] amdgpu 0000:11:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[ 3301.257865] amdgpu 0000:11:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 3301.257866] amdgpu 0000:11:00.0: amdgpu: 	WALKER_ERROR: 0x0
[ 3301.257867] amdgpu 0000:11:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[ 3301.257868] amdgpu 0000:11:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[ 3301.257869] amdgpu 0000:11:00.0: amdgpu: 	RW: 0x0

Maybe this patch would help? Rather than just reverting that commit. https://patchwork.freedesktop.org/patch/575997/

(crossposted from drm/amd issue for anyone looking for a workaround)

I am on NixOS with Linux 6.6.8, with RX 7900 XT and Ryzen 9 7900x. I’ve applied the following patch, given rcrisostomo’s bisected commit (96c211f1f9ef82183493f4ceed4e347b52849149):

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
index 62b205dac..efb05acea 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
@@ -330,12 +330,6 @@ static void kfd_init_apertures_vi(struct kfd_process_device *pdd, uint8_t id)
        pdd->gpuvm_limit =
                pdd->dev->kfd->shared_resources.gpuvm_size - 1;

-       /* dGPUs: the reserved space for kernel
-        * before SVM
-        */
-       pdd->qpd.cwsr_base = SVM_CWSR_BASE;
-       pdd->qpd.ib_base = SVM_IB_BASE;
-
        pdd->scratch_base = MAKE_SCRATCH_APP_BASE_VI();
        pdd->scratch_limit = MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base);
 }
@@ -345,18 +339,18 @@ static void kfd_init_apertures_v9(struct kfd_process_device *pdd, uint8_t id)
        pdd->lds_base = MAKE_LDS_APP_BASE_V9();
        pdd->lds_limit = MAKE_LDS_APP_LIMIT(pdd->lds_base);

-       pdd->gpuvm_base = PAGE_SIZE;
+       /* Raven needs SVM to support graphic handle, etc. Leave the small
+        * reserved space before SVM on Raven as well, even though we don't
+        * have to.
+        * Set gpuvm_base and gpuvm_limit to CANONICAL addresses so that they
+        * are used in Thunk to reserve SVM.
+        */
+       pdd->gpuvm_base = SVM_USER_BASE;
        pdd->gpuvm_limit =
                pdd->dev->kfd->shared_resources.gpuvm_size - 1;

        pdd->scratch_base = MAKE_SCRATCH_APP_BASE_V9();
        pdd->scratch_limit = MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base);
-
-       /*
-        * Place TBA/TMA on opposite side of VM hole to prevent
-        * stray faults from triggering SVM on these pages.
-        */
-       pdd->qpd.cwsr_base = pdd->dev->kfd->shared_resources.gpuvm_size;
 }

 int kfd_init_apertures(struct kfd_process *process)
@@ -413,6 +407,12 @@ int kfd_init_apertures(struct kfd_process *process)
                                        return -EINVAL;
                                }
                        }
+
+                       /* dGPUs: the reserved space for kernel
+                        * before SVM
+                        */
+                       pdd->qpd.cwsr_base = SVM_CWSR_BASE;
+                       pdd->qpd.ib_base = SVM_IB_BASE;
                }

                dev_dbg(kfd_device, "node id %u\n", id);

After applying this patch to 6.6.8 the issue is fixed for me, with Blender correctly recognizing my GPU and being able to render with Cycles.

Applying this patch to the Linux Zen Kernel version 6.6.8 makes blender work for me

Can confirm that this also happens on RX 7900 XTX + Ryzen 7950X on 6.6.8-arch1-1. Crashes when running blender and going to Edit -> Preferences. Also happens when running SVPManager. This leaves a zombie process that indefinitely blocks shutdown.

[39435.607525] amdgpu 0000:03:00.0: amdgpu: bo 00000000da340e9d va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000840
[39435.607529] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22
[39435.607530] amdgpu: Failed to map bo to gpuvm
[39435.613925] BUG: kernel NULL pointer dereference, address: 0000000000000008
[39435.613927] #PF: supervisor read access in kernel mode
[39435.613929] #PF: error_code(0x0000) - not-present page
[39435.613930] PGD 0 P4D 0 
[39435.613932] Oops: 0000 [#1] PREEMPT SMP NOPTI
[39435.613934] CPU: 30 PID: 178800 Comm: blender Tainted: G           OE      6.6.8-arch1-1 #1 2ffcc416f976199fcae9446e8159d64f5aa7b1db
[39435.613936] Hardware name: ASUS System Product Name/ROG CROSSHAIR X670E HERO, BIOS 1602 08/15/2023
[39435.613938] RIP: 0010:dma_resv_add_fence+0x47/0x1f0
[39435.613943] Code: 89 54 24 04 48 85 f6 74 21 48 8d 7e 38 b8 01 00 00 00 f0 0f c1 46 38 85 c0 0f 84 59 01 00 00 8d 50 01 09 c2 0f 88 5d 01 00 00 <49> 8b 46 08 48 3d 00 73 7b 9d 0f 84 c9 00 00 00 48 3d a0 72 7b 9d
[39435.613944] RSP: 0018:ffffc9001af23cf0 EFLAGS: 00010246
[39435.613946] RAX: ffff8883d36b8000 RBX: ffff8883d36b8158 RCX: 00000001ca242a1e
[39435.613947] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8883d36b8158
[39435.613948] RBP: ffff88837ceff000 R08: 0000000000000000 R09: 000000000003a5f0
[39435.613949] R10: 000000000003a5f0 R11: 0000000000000100 R12: ffff888113b27b38
[39435.613950] R13: ffff888113b27b40 R14: 0000000000000000 R15: ffff8883d36b8000
[39435.613951] FS:  00007f8b4b430000(0000) GS:ffff88903df80000(0000) knlGS:0000000000000000
[39435.613952] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39435.613953] CR2: 0000000000000008 CR3: 0000000bfa7da000 CR4: 0000000000f50ee0
[39435.613955] PKRU: 55555554
[39435.613956] Call Trace:
[39435.613957]  <TASK>
[39435.613960]  ? __die+0x23/0x70
[39435.613964]  ? page_fault_oops+0x171/0x4e0
[39435.613968]  ? exc_page_fault+0x7f/0x180
[39435.613971]  ? asm_exc_page_fault+0x26/0x30
[39435.613976]  ? dma_resv_add_fence+0x47/0x1f0
[39435.613980]  amdgpu_amdkfd_gpuvm_acquire_process_vm+0x212/0x530 [amdgpu fd8186640f20c9957c4ed5bc533f74908ab57ec4]
[39435.614122]  kfd_process_device_init_vm+0xb0/0x320 [amdgpu fd8186640f20c9957c4ed5bc533f74908ab57ec4]
[39435.614245]  ? srso_alias_return_thunk+0x5/0x7f
[39435.614248]  kfd_ioctl_acquire_vm+0x89/0xc0 [amdgpu fd8186640f20c9957c4ed5bc533f74908ab57ec4]
[39435.614363]  kfd_ioctl+0x3c9/0x4e0 [amdgpu fd8186640f20c9957c4ed5bc533f74908ab57ec4]
[39435.614470]  ? __pfx_kfd_ioctl_acquire_vm+0x10/0x10 [amdgpu fd8186640f20c9957c4ed5bc533f74908ab57ec4]
[39435.614579]  __x64_sys_ioctl+0x94/0xd0
[39435.614582]  do_syscall_64+0x5d/0x90
[39435.614585]  ? do_syscall_64+0x6c/0x90
[39435.614587]  ? srso_alias_return_thunk+0x5/0x7f
[39435.614588]  ? do_syscall_64+0x6c/0x90
[39435.614590]  ? exc_page_fault+0x7f/0x180
[39435.614592]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[39435.614595] RIP: 0033:0x7f8b5f92a3af
[39435.614614] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[39435.614615] RSP: 002b:00007fffcfe541c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[39435.614617] RAX: ffffffffffffffda RBX: 00007fffcfe542b0 RCX: 00007f8b5f92a3af
[39435.614618] RDX: 00007fffcfe54330 RSI: 0000000040084b15 RDI: 000000000000000d
[39435.614619] RBP: 00007fffcfe54330 R08: 000000000000000b R09: 0000000000000008
[39435.614620] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000040084b15
[39435.614621] R13: 000000000000000d R14: 00007f8acae29f10 R15: 00007f8b18e3a180
[39435.614624]  </TASK>
[39435.614625] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer snd_seq xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink iptable_nat xt_addrtype iptable_filter br_netfilter bridge stp llc wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel ccm cmac algif_hash algif_skcipher af_alg overlay bnep btusb btrtl btintel btbcm btmtk snd_usb_audio bluetooth snd_usbmidi_lib snd_ump xpad mousedev snd_rawmidi joydev ecdh_generic apple_mfi_fastcharge ff_memless crc16 snd_seq_device intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd vfat fat kvm iwlmvm irqbypass crct10dif_pclmul crc32_pclmul mac80211 polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel libarc4 sha512_ssse3 sha1_ssse3 eeepc_wmi asus_nb_wmi snd_hda_codec_hdmi aesni_intel asus_wmi iwlwifi ledtrig_audio crypto_simd sparse_keymap cryptd snd_hda_intel platform_profile i8042 snd_intel_dspcfg sp5100_tco snd_intel_sdw_acpi asus_ec_sensors rapl serio
[39435.614675]  intel_wmi_thunderbolt wmi_bmof cfg80211 thunderbolt snd_hda_codec pcspkr i2c_piix4 ccp ucsi_acpi snd_hda_core typec_ucsi igc rfkill snd_hwdep typec gpio_amdpt roles gpio_generic mac_hid nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_masq nft_ct nft_chain_nat vboxnetflt(OE) nf_nat vboxnetadp(OE) nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vboxdrv(OE) pkcs8_key_parser k10temp snd_aloop snd_pcm snd_timer snd soundcore v4l2loopback_dc(OE) nf_tables videodev mc i2c_dev loop fuse dm_mod nfnetlink ip_tables x_tables sr_mod cdrom hid_apple usbhid uas usb_storage amdgpu btrfs blake2b_generic i2c_algo_bit libcrc32c drm_ttm_helper crc32c_generic xor ttm raid6_pq drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched nvme crc32c_intel drm_display_helper sha256_ssse3 nvme_core xhci_pci xhci_pci_renesas cec nvme_common video wmi
[39435.614721] CR2: 0000000000000008
[39435.614723] ---[ end trace 0000000000000000 ]---
[39435.614724] RIP: 0010:dma_resv_add_fence+0x47/0x1f0
[39435.614726] Code: 89 54 24 04 48 85 f6 74 21 48 8d 7e 38 b8 01 00 00 00 f0 0f c1 46 38 85 c0 0f 84 59 01 00 00 8d 50 01 09 c2 0f 88 5d 01 00 00 <49> 8b 46 08 48 3d 00 73 7b 9d 0f 84 c9 00 00 00 48 3d a0 72 7b 9d
[39435.614727] RSP: 0018:ffffc9001af23cf0 EFLAGS: 00010246
[39435.614729] RAX: ffff8883d36b8000 RBX: ffff8883d36b8158 RCX: 00000001ca242a1e
[39435.614730] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8883d36b8158
[39435.614731] RBP: ffff88837ceff000 R08: 0000000000000000 R09: 000000000003a5f0
[39435.614732] R10: 000000000003a5f0 R11: 0000000000000100 R12: ffff888113b27b38
[39435.614733] R13: ffff888113b27b40 R14: 0000000000000000 R15: ffff8883d36b8000
[39435.614734] FS:  00007f8b4b430000(0000) GS:ffff88903df80000(0000) knlGS:0000000000000000
[39435.614735] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39435.614736] CR2: 0000000000000008 CR3: 0000000bfa7da000 CR4: 0000000000f50ee0
[39435.614737] PKRU: 55555554
[39435.614738] note: blender[178800] exited with irqs disabled