ROCm: ROCm-3.9, ROCm-3.10 crash with gfx803

If you installed ROCm-3.9, ROCm-3.10 with gfx803, you will crash on very beginning of running tensorflow or pytorch. Error info as follows:

work@0b7758c3094d:~/test/examples/mnist$ python3 main.py
/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")
Aborted (core dumped)

OS: Ubuntu-20.04 CPU: Xeon 2620v3 GPU: RX580 8G (Polaris10) CHIP ID: 0x67df Python: 3.8.5 Tensorflow-rocm: 2.3.1

hip sample run ok.

work@0b7758c3094d:~/test$ make
/opt/rocm/hip/bin/hipcc  square.cpp -o square.out
work@0b7758c3094d:~/test$ ./square.out
info: running on device Device 67df
info: allocate host mem (  7.63 MB)
info: allocate device mem (  7.63 MB)
info: copy Host2Device
info: launch 'vector_square' kernel
info: copy Device2Host
info: check result
PASSED!

UPDATE 2020-11-05: The reason is rocsparse is not compiled with gfx803, after compile rocsparse with AMDGPU_TARGETS=gfx803, and reinstalled the custom rocsparse package, this problem solved.

It is a bug on rocSPARSE cmake config, the AMDGPU_TARGETS never by used.

Pull Request had been merged. https://github.com/ROCmSoftwarePlatform/rocSPARSE/pull/213

#1265 is still there.

UPDATE 2020-11-21: wrote a doc for gfx803 issues detals. https://github.com/xuhuisheng/rocm-build/blob/develop/docs/gfx803.md

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 16
Comments: 44 (7 by maintainers)

Commits related to this issue

move AMDGPU_TARGETS before include Dependencies (#213) AMDGPU_TARGETS marked as cache string. When after include Dependencies.cmake, AMDGPU_TARGETS always get cached variable gfx900;gfx906;gfx908, I... — committed to ROCm/rocSPARSE by xuhuisheng 4 years ago

Most upvoted comments

Hi @xuhuisheng and others, Thanks for the issue. Let me check on this.

+10

rkothako on Nov 2, 2020

Same here… (btw, there is a typo in word Coudn’t)

/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") Aborted (core dumped)

OS: Ubuntu-20.04.1 LTS CPU: Intel i3-6100 GPU: RX580 Python: 3.8.5 Tensorflow-rocm: 2.3.1

I followed the guide AMD provided https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html twice, both times in a fresh Ubuntu installation.

rocminfo and clinfo seem to be working properly, I will attach the command output bellow rocminfo.pdf clinfo.pdf

I noticed each time you exit a python interactive session where tensorflow was imported it threw the exact same error: python_import_TF.pdf

I also tried the guide https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU along with the video (which is more complete) https://www.youtube.com/watch?v=fkSRkAoMS4g without any success. (It fails the same way when you try to run the tf_cnn_benchmarks.py script)

Grench6 on Oct 30, 2020

The pull request of rocSPARSE had been merged. Local checked successfully. I will close this issue and hope it will be released soon.

xuhuisheng on Nov 5, 2020

@rkothako Thank you for replying. Please also check this issue https://github.com/RadeonOpenCompute/ROCm/issues/1265 . The ROCm-3.7 and ROCm-3.8 cannot run on gfx803 correctly. While ROCm-3.9 totally crashed with gfx803.

xuhuisheng on Nov 2, 2020

@VegetaDTX No problem, I have already tested it, and downgrading works as expected 😃 I wrote a mini-guide here to downgrade and install Rocm, tensorflow-rocm, and test it with a benchmark.

Grench6 on Nov 11, 2020

@angimenez Found a comment about gfx803 on rocSPARSE develop branch. https://github.com/ROCmSoftwarePlatform/rocSPARSE/commit/f8791e9b09c4ac6d72f56fb3c6663273dce2aea5#commitcomment-43334853 We should wait AMD to fix it.

update: fixed https://github.com/ROCmSoftwarePlatform/rocSPARSE/commit/7de15942cf9fe0fb7db80e0c45ebb4d1e3086668

xuhuisheng on Nov 24, 2020

Hello there good folks of github! I have the exact same problem. I spent half of the day setting up the environment for working with GPU powered Tensorflow projects and I get this error in the end that I just can’t seem to find a solution to 😦

OS: Ubuntu 20.04.1 LTS CPU: AMD® Ryzen 5 1600x six-core processor × 12 GPU: Radeon RX 570 Series (POLARIS10, DRM 3.40.0, 5.4.0-52-generic, LLVM 10.0.0) Tensorflow-rocm version: 2.3.2

I hate to sound negative, but things like these seriously make me want to give up techy things once and for all and just go become a professional shepard…

VegetaDTX on Nov 1, 2020