ROCm: OpenCL stopped working "no devices" when upgrading to ROCm 3.0 from ROCm 2.10
I updated from ROCm 2.10 to ROCm 3.0, and OpenCL stopped working by reporting 0 devices. There are no errors in dmesg.
Kernel: Linux 5.4.5 and 5.5.0-rc2 (same behavior on both), GPU RadeonVII. rocm-smi reports correctly all the GPUs, so it seems the hardware is detected and initialized correctly:
~/rocm-opencl$ ~/ROC-smi/rocm-smi
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 33.0c 27.0W 809Mhz 351Mhz 20.0% auto 250.0W 0% 0%
[etc]
But both /usr/bin/clinfo and /opt/rocm/opencl/bin/x86_64/clinfo report no devices:
/opt/rocm/opencl/bin/x86_64/clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3052.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
ERROR: clGetDeviceIDs(-1)
/usr/bin/clinfo
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.1 AMD-APP (3052.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Host timer resolution 1ns
Platform Extensions function suffix AMD
Platform Name AMD Accelerated Parallel Processing
Number of devices 0
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No devices found in platform
The ROCm packages that I have installed:
hsa-rocr-dev/Ubuntu 16.04,now 1.1.9.0-rocm-rel-3.0-6-7128d0d amd64 [installed,automatic]
hsakmt-roct/Ubuntu 16.04,now 1.0.9-298-gea01eb3 amd64 [installed,automatic]
rocm-opencl/Ubuntu 16.04,now 2.0.0-rocm-rel-3.0-6-9a4afec amd64 [installed]
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 6
- Comments: 71
For those wondering how to revert to a previous version on Debian-based distros:
Replace
http://repo.radeon.com/rocm/apt/debian/withhttp://repo.radeon.com/rocm/apt/2.10.0/Thanks all for the help/suggestions over this thread. I do not think this issue is still valid now as we are currently on ROCm 4.0 and ROCm upgrade is not supported at present. We are streamlining our ROCm package versions and so ROCm upgrade support will be available very soon, like in 2 to 3 months. Internal validation already started and will share an official support on this soon. Please stay tuned for more updates via our ROCm documentation page. Thank you.
I can’t believe this, but jcdutton’s answer fixed the issue for me, Fresh install of Ubuntu 20.04 on a new system, vega 56, and fresh install of rocm 3.3.0 from the repo. rocminfo worked but clinfo segfaulted. Installing
libncurses5andlibncursesw5solved the issue.The clue is in the strace, which not long before the crash claims it can’t find libtinfo.so.5. But then it complains about comgr before it crashes, leading many to believe that’s the problem.
I strongly recommend amd add a dependency on libncurses5 for Ubuntu, since newer versions don’t install it by default. Oh, and you probably shouldn’t segv if its not there either. Hope this helps others.
Since many find this issue, tracing the clinfo strace seems to be the most generic solution to find out what’s wrong:
Other errors described in this issue seems to be self-explanatory (missing kernel arguments).
The install instructions mention that you should add yourself to the video group, but I also had to add myself to the “render” group because that owned /dev/dri/renderD128
I wanted to clean up ROCm and OpenCL, so I removed everything (hopefully 2.10 as well as 3.0), and re-installed only the 3.0 versions of rocm-dev, and rocm-opencl-dev. I was confident that 3.0 would work after this, because this time I had installed libcurses5. Alas!, clinfo reported 0 devices, and programs that use OpenCL refuse to work. Am I missing something? Do I need to install another package for 3.0?
By the way, installing just these 2 packages is working great for me with 2.10. With 2.10 I install rocm-dev, and rocm-opencl-dev, and don’t install rocm-dkms, because I’m running Ubuntu 19.10 with kernel 5.3. I install ROCm because I needed a version of OpenCL that’s fairly new, but I only need OpenCL. So far, what I’ve seen is that if you only need OpenCL, programs that use OpenCL will happily work even if you don’t install rocm-dev, and only install rocm-opencl-dev (which installs rocm-opencl), or even only rocm-opencl.
strace.txt
@ableeker That will not work. You need to install version 5. Which distro version do you have? Installing this might help: ncurses-compat-libs
Well, I had this problem. I have kernel 5.4.6 working with ROCm 3.0 on a Vega 56 with a AMD 1950X cpu. I diagnosed the problem to a bug in the AMD binary blob bit of ROCm. ROCm is not 100% open source. Please see my previous posts here for the work around. I.e. Move RAM chips. The problem is related to NUMA. So it only affects CPUs that have more than 1 node, which I think, at this point, is only some of the higher end AMD CPUs. I have a simple C test program, that probes the NUMA configuration in the same way the binary blob tries, and demonstrates the exact case where the clinfo fails. When I move the RAM chips, my simple C test program passes and clinfo then works. Unfortunately, the ROCm binary blob is where the probing happens, so it cannot be fixed without help from AMD.
rkohato & all
Looked and I have the comgr and rocm-smi packages installed and it is not working.
I am using OpenSuse and 5.4 Kernel as a result the rock-dkms is not used/needed.
I have a “clean” install of just the 3.0 version from the latest zypper repository:
zypper ar http://repo.radeon.com/rocm/zyp/zypper/ rocm-repo
In my post earlier it shows the rocm related packages that were installed.
To the developers here: Why don’t you try to just do a clean OpenSuse Linux install and then add your repository and the packages. Then run clinfo and straighten out that issue and then run some real testing on common OpenCL Kernels and applications. Then post updates with a readme of tested configurations.
I do not see anyone but the most experienced users wanting to try/use the ROCm stuff at this point. Little point in spending all this effort on ROCm if no one can actually use it!
@rkothako thank you, but I am talking about an install without rock-dkms, as is required when using a recent kernel that is not supported by rock-dkms. Did you try your instructions on a system with Linux kernel 5.4 or 5.5?
Hi @preda and all, We have found the root cause the problem and the workaround is given below. After upgrading OpenCL-only-ROCm from 2.10 to 3.0, just install the packages on top of it: comgr rocm-smi-lib64 (sudo apt install comgr rocm-smi-lib64) Then clinfo will start working.
@rkothako : is there a way to upgrade from ROCm 2.10 to ROCm 3.0, OpenCL only, without dkms? Please let me know how I can do this upgrade.
@rkothako could you please clarify what are the working steps for an upgrade from 2.10 to 3.0 OpenCL-only without dkms? i.e. I’m not using rocm-dkms, and most likely rocm-dkms would fail to compile anyway on the kernel I’m using (5.5).
And you could also please explain what is the problem that is fixed by the working upgrade steps (to help our understanding), thanks.
Thanks all. Clinfo works good with 3.0 upgrade from 2.10 as below
We have logged an internal issue for proper fix. Currently we are working on this issue.
Moving back to ROCm 2.10 (from 3.0) produces working OpenCL, I see these packages are installed