ROCm: 4.3.1 / gfx803 / tensorflow-rocm 2.6.0 - librocblas.so.0: cannot open shared object file: No such file or directory

So i know gfx803 is only unofficially support now, but any help would be great!

Ubuntu 20.04 gfx803 epyc zen1

so I just run: sudo apt install rocm-dkms rocm-libs

rocm-smi shows all my gpus (all are the same gfx803)

and install tensorflow-rocm into a virtualenv from inside of pycharm

when i try to run:

from ai_benchmark import AIBenchmark benchmark = AIBenchmark(use_CPU=False, verbose_level=1) results = benchmark.run()

I get the following error:

~/Projects/mlenv/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py in <module> 63 try: ---> 64 from tensorflow.python._pywrap_tensorflow_internal import * 65 # This try catch logic is because there is no bazel equivalent for py_extension.

ImportError: librocblas.so.0: cannot open shared object file: No such file or directory

If install this dirty version of rocblas from here(which is for 4.3.0 not 4.3.1): https://github.com/xuhuisheng/rocm-gfx803

the benchmark runs, but doesn’t see my GPUs and rocm-smi show 0% activity…

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

tensorflow_rocm-2.6.0 drop gfx803 support. You could compile tensorflow_rocm-2.6.0 by yourself.

In my environment, tensorflow_rocm-2.6.0 can run properly on gfx803.

work@91f6c555a036:~/test$ python3 test.py
2021-09-25 07:34:07.692003: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-25 07:34:07.693618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7692 MB memory:  -> device: 0, name: Device 67df, pci bus id: 0000:02:00.0
2021-09-25 07:34:08.772256: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-09-25 07:34:08.798666: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-09-25 07:34:08.804292: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-09-25 07:34:08.807748: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
Epoch 1/5
2021-09-25 07:34:09.096113: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
1875/1875 [==============================] - 5s 2ms/step - loss: 0.3016 - accuracy: 0.9119
Epoch 2/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.1476 - accuracy: 0.9551
Epoch 3/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.1100 - accuracy: 0.9663
Epoch 4/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0891 - accuracy: 0.9732
Epoch 5/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0772 - accuracy: 0.9758
2021-09-25 07:34:31.313063: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-09-25 07:34:31.318229: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-09-25 07:34:31.322008: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-09-25 07:34:31.445636: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
313/313 [==============================] - 1s 2ms/step - loss: 0.0735 - accuracy: 0.9777

:1:rocdevice.cpp :438 : 47612163925 us: hsa_init failed.

It is said that device init failed. https://github.com/ROCm-Developer-Tools/ROCclr/blob/rocm-4.3.x/device/rocm/rocdevice.cpp#L438

hsa_init() just try to acquire the Runtime, but failed. https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.3.x/src/core/runtime/hsa.cpp#L206 https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.3.x/src/core/runtime/runtime.cpp#L94

Guess the reason is HSA_STATUS_ERROR_OUT_OF_RESOURCES. Need recompile ROCR-Runtime, print more log.

The tensorflow_rocm-2.4.3 had official support on gfx803. https://pypi.org/project/tensorflow-rocm/2.4.3/

I just upload gfx803 version of tensorflow_rocm-2.6.0 to https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm43/tensorflow-2.6.0-cp38-cp38-linux_x86_64.whl, please try it.

The source codes comes from this branch: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/tree/r2.6-rocm-enhanced