ROCm: [Driver] *ERROR* MES failed to response msg=2
Triggered by running https://github.com/RadeonOpenCompute/rocm_bandwidth_test in a loop while running https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/gpu-burn in a loop.
1x 7900XTX ASROCK ROMED8-2T EPYC 7662 Ubuntu 22.04, Kernel 6.2.14-060214-generic, ROCm 5.5
sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover will recover the GPU.
[ 111.406216] amdgpu 0000:83:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:10 pasid:32769, for process rocm-bandwidth- pid 3286 thread rocm-bandwidth- pid 3286)
[ 111.406237] amdgpu 0000:83:00.0: amdgpu: in page starting at address 0x00007f0000000000 from client 10
[ 111.406246] amdgpu 0000:83:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00A01A30
[ 111.406253] amdgpu 0000:83:00.0: amdgpu: Faulty UTCL2 client ID: SDMA0 (0xd)
[ 111.406259] amdgpu 0000:83:00.0: amdgpu: MORE_FAULTS: 0x0
[ 111.406265] amdgpu 0000:83:00.0: amdgpu: WALKER_ERROR: 0x0
[ 111.406270] amdgpu 0000:83:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 111.406275] amdgpu 0000:83:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 111.406280] amdgpu 0000:83:00.0: amdgpu: RW: 0x0
[ 114.188710] amdgpu 0000:83:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:10 pasid:32769, for process rocm-bandwidth- pid 3286 thread rocm-bandwidth- pid 3286)
[ 114.188729] amdgpu 0000:83:00.0: amdgpu: in page starting at address 0x00007f0000000000 from client 10
[ 114.188738] amdgpu 0000:83:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00A01A30
[ 114.188746] amdgpu 0000:83:00.0: amdgpu: Faulty UTCL2 client ID: SDMA0 (0xd)
[ 114.188754] amdgpu 0000:83:00.0: amdgpu: MORE_FAULTS: 0x0
[ 114.188759] amdgpu 0000:83:00.0: amdgpu: WALKER_ERROR: 0x0
[ 114.188765] amdgpu 0000:83:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 114.188770] amdgpu 0000:83:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 114.188776] amdgpu 0000:83:00.0: amdgpu: RW: 0x0
[ 114.302856] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
[ 114.303173] amdgpu: failed to add hardware queue to MES, doorbell=0x1202
[ 114.303176] amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ 114.303179] amdgpu: Failed to restore queue 0
[ 114.303182] amdgpu: Failed to restore process queues
[ 114.303184] amdgpu: Failed to restore queues of pasid 0x8001
[ 114.303450] amdgpu 0000:83:00.0: amdgpu: GPU reset begin!
[ 114.303477] amdgpu: Failed to evict queue 1
[ 114.303483] amdgpu: Failed to suspend process 0x8002
[ 114.309700] amdgpu 0000:83:00.0: amdgpu: recover vram bo from shadow start
[ 114.309705] amdgpu 0000:83:00.0: amdgpu: recover vram bo from shadow done
[ 114.412749] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 114.413073] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 114.420094] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 114.420379] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 114.523222] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 114.523499] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 114.530476] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 114.530750] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 114.634108] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 114.634391] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
...
[ 117.293167] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
[ 117.293439] [drm:amdgpu_mes_add_hw_queue [amdgpu]] *ERROR* failed to add hardware queue to MES, doorbell=0x2200
[ 117.293712] [drm:amdgpu_mes_self_test [amdgpu]] *ERROR* failed to add ring
[ 117.294161] amdgpu 0000:83:00.0: amdgpu: GPU reset(1) succeeded!
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 17
- Comments: 25
I can reproduce this on my 7900 XTX as well
I am still seeing this on both ROCm 6.0 and the 6.0.1 packages at https://repo.radeon.com/rocm/apt/6.0.1 / https://repo.radeon.com/amdgpu/apt/6.0.1 on fresh Ubuntu 22.04.
Linux amdsux 6.5.0-15-generic #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/LinuxI realize it does no good for everyone else experiencing this issue to add yet another “yeah, same here” and some venting, so the rest of you may stop reading now. But AMD, you really need to know this is not acceptable.
It’s been over a year since the card was released and still the software isn’t even stable, let alone all the trouble it takes to get set up vs nVidia. All whilst AMD releases these frankly patronizing announcements about how great and open and “ready” ROCm 6.0 is, “driving an inflection with developers”. Erm, yeah… it’s making us run away.
Please understand developer sentiment towards AMD is very poor and getting worse, not better. Everyone else like me who is trying to do research but they bought an AMD card because they couldn’t afford a nVidia card, is never ever going to buy an AMD GPU again. And we will tell our employers to never ever buy AMD too, and they will listen, because developer time is far more expensive than nVidia’s hardware. We should have listened to our friends who told us to just buy nVidia cards, but we won’t make this mistake again. This will last years and years. The best you can do now is admit your failures and try to salvage your reputation for CPUs. Because it doesn’t matter how good or cheap your GPUs get, we will never risk wasting our time again.
For all this talk about openness, we’re all here seeing this issue and does anyone outside AMD even understand what this message means? For that matter, does anyone INSIDE AMD understand what it means? Please explain it to us.
So far as I can tell, “MES” is the MicroEngine Scheduler, and https://docs.kernel.org/gpu/amdgpu/driver-core.html#graphics-and-compute-microcontrollers says it is “unused”. Mumblings on the internet suggest it is used now, and well, we’re in this GitHub issue together aren’t we? Beyond that, things get extremely hazy. The kernel code that actually produces that log message is completely opaque. None of these functions are documented. MES is… a separate microcontroller controlled by a firmware blob? Do I have that right? And what exactly does this mean? https://github.com/torvalds/linux/blob/41bccc98fb7931d63d03f326a746ac4d429c1dd3/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c#L563
Is this AMD’s “open software approach”? Hanging binary blobs and sending tarballs to geohot that he’s not allowed to distribute?!
PS: I’m still not over the terrible software support for the RX480 last time. I hoped things had changed. Fool me twice, shame on me. But never again. I’m losing my mind over this because I’ve lost months of productivity… I can’t stress enough, my experiment seems to be working whilst I can get it to run. But the reason I cannot publish is because I chose the wrong GPU. I’m going to have to go get a “real job” instead.
Using an RX 7900 XTX with no monitor attached, training a diffusion model or just running inference. (Getting monitor output from another GPU.)
Still seeing the same MES hangs with Arch Linux 6.8.1-arch1-1 / ROCm 6.0.2 / PyTorch 2.4.0.dev20240323+rocm6.0 This MES hang causes other unrelated Python programs to core dump too and GPU keeps pulling 120W~ of power when it hangs.
Only way to recover from the hang is running
rocm-smi --gpureset -id 0if you have no monitor attached to the AMD GPU. If you have a monitor attached, only way to recover is hard rebooting.Adding
amdgpu.mes=0 amdgpu.mes_kiq=1to the grub config doesn’t seem to be doing anything either.Log on PyTorch when MES hangs:
Journalctl in reverse order:
<\details>
and just checking - you are on new kernels with the 6.0.2 binaries or your code compiled against the 6.0.2 libs? (yes - i have had to compile pytorch against 6.0.2).
mainly because i finally got there - and closed my ticket… #2689
Im not saying it wasnt easy or simple but I can hit 20-22Gb consistently somehow now… and Im over like 3 days of uptime too… which is a first since november last year…
and here is the worst ever way to show what ubuntu libs got me there…
and then made pytorch find the folder and run the special pytorch convert from cuda to hip and then build, and then build and install, and then build pytorchvision and audio after… so much faf…
But - I run now…
Sadly it did not help me 😦 High VRAM workloads still cause the MES unrecoverable error, every time, which requires a reboot to solve.
Any reason you want to run the older kernel?
A few people are experiencing better results on the newer kernels, and the 6.0.2 drivers.
I got to a running setup (with some minor issues) with the 6.5.x stock one - ended up closing my ticket too. #2689
I’ve had this exact issue frequently with ROCm 5.7 on Radeon RX 7900 XTX.
Upgrading to ROCm 6.0 has solved it for me. At least Stable Diffusion with torch-2.3.0+rocm5.7 still works on ROCm 6.0, and without crashes now, no need to restart after a failed GPU reset.
The non-critical GCVM_L2_PROTECTION_FAULT_STATUS errors are still there, but they don’t force me to restart after MES failure and a GPU reset attempt.
Currently working:
Linux moon 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux