ROCm: Regression in rocm 5.3 and newer for gfx1010
Since when pytorch 2 was officially released, i wasn’t able to run it on my 5700XT, while i was previously able to use it just fine on pytorch 1.13.1 by setting “export HSA_OVERRIDE_GFX_VERSION=10.3.0” There are many reporting the same issue on the 5000 series, like for example https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6420
–precison-full and --no-half are also needed because the card seems like can’t use fp16 on linux/rocm, as already reported here https://github.com/RadeonOpenCompute/ROCm/issues/1857
i also read about the PCI atomics requirement, following this issue https://github.com/pytorch/pytorch/issues/103973 …But that doesn’t seems to be my case. the command “grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties” returns:
/sys/class/kfd/kfd/topology/nodes/0/io_links/0/properties:flags 3
/sys/class/kfd/kfd/topology/nodes/1/io_links/0/properties:flags 1
Also, i tried to compile pytorch using the new “-mprintf-kind=buffered” flag, but it didn’t change anything.
Finally, i recently found out that pytorch 2 works just fine on gfx1010 if that’s compiled by rocm 5.2, as suggested here https://github.com/pytorch/pytorch/issues/106728
About this issue
- Original URL
- State: open
- Created 9 months ago
- Reactions: 1
- Comments: 23
ok, we will tackle this issue next @kmsedu @DGdev91
Well, for example to be able to use the official pytorch builds instead of using old nighties or compliling from source.
To be clear, on Ubuntu were you using
libhipblas-dev(which installs to/usr/lib/x86_64-linux-gnu) or were you usinghipblas-dev(which installs to/opt/rocm/lib)? If you were usinglibhipblas-dev, I’m very interested in learning more. Could you provide some instructions on how to reproduce the problem?Using
HSA_OVERRIDE_GFX_VERSION=10.3.0on RDNA 2 GPUs is fundamentally different from using it on RDNA 1 GPUs. All RDNA 2 GPUs use the exact same instructions, but there’s a bunch of differences between the instructions used on RDNA 1 and RDNA 2 GPUs. The only way to undo this ‘regression’ withHSA_OVERRIDE_GFX_VERSIONwould be to change LLVM so that the compiler only uses instructions available on RDNA 1, even when asked to compile for RDNA 2. That’s not going to happen.A better path to getting gfx1010 enabled in PyTorch would be to build the ROCm math and AI libraries for gfx1010 (or gfx10.1-generic). That is probably not going to happen in AMD’s official packages, but there are other groups building and distributing ROCm packages. I can’t speak for other distributions, but I expect to have it enabled later this year on Debian. With that said, my work with Debian is strictly volunteer work (on top of my full-time job), so don’t expect it to happen quickly.
Ok, now it’s more clear. I can confirm i used hipblas-dev
I was also thinking that the hsa override flag was needed for rocblas too, because i couldn’t use it on native 1010 since the libs for 1010 were missing in the official packages.
I also just found this PR wich has been merged just 5 days ago wich makes life a bit more simple for compiling the tensile libs for 1010 https://github.com/ROCm/Tensile/pull/1862
Ok, then you have just the same problem. We know that anything compiled using Rocm 5.2 or older works just fine on that card. If you try to force it to a newer version in webui_user.sh probably it’s not going to work. Also, usually HSA_OVERRIDE_GFX_VERSION=10.3.0 is used for the override. Automatic1111’s webui should force it automatically for older gpus, so maybe you were actually using that.
I know. But it’s still wierd that a gpu wich worked perfectly fine with pytorch compiled for an older romc version + HSA_OVERRIDE_GFX_VERSION=10.3.0 (even if not officially supported) suddently stops working on everything compiled using something newer. I also tried to pick up an old docker image with rocm 5.2 but it doesn’t seems able to compile it.
This is also true for other softwares wich rely on rocm, like llama-cpp with hipblas support