pocl: POCL Remote fails on AMD APU with mesa clover radeonsi driver on linux
The examples and all other programs that use opencl do not work with pocl remote. But they work otherwise on terminals that do not have the pocl env set.
The error that I got from running example1 is the following:
./example1
[2024-01-12 16:21:38.761070069]POCL: in fn POclCreateCommandQueue at line 103:
| GENERAL | Created Command Queue 3 (0x55e44efa4b70) on device 0
[2024-01-12 16:21:38.761169121]POCL: in fn POclCreateContext at line 232:
| GENERAL | Created Context 2 (0x55e44ef809e0)
[2024-01-12 16:21:38.761610756]POCL: in fn POclCreateCommandQueue at line 103:
| GENERAL | Created Command Queue 4 (0x55e44efa4c70) on device 0
[2024-01-12 16:21:39.066505250]POCL: in fn POclCreateKernel at line 133:
| GENERAL | Created Kernel dot_product (0x55e44efab3b0)
[2024-01-12 16:21:39.066534969]POCL: in fn POclSetKernelArg at line 107:
| GENERAL | Kernel dot_product || SetArg idx 0 || float4* || Local 0 || Size 8 || Value 0x7ffc19fb5aa0 || Pointer 0x55e44efa9d00 || *(uint32*)Value: 0 || *(uint64*)Value: 0 ||
Hex Value: 009DFA4E E4550000
[2024-01-12 16:21:39.066547048]POCL: in fn POclSetKernelArg at line 107:
| GENERAL | Kernel dot_product || SetArg idx 1 || float4* || Local 0 || Size 8 || Value 0x7ffc19fb5aa8 || Pointer 0x55e44efab010 || *(uint32*)Value: 0 || *(uint64*)Value: 0 ||
Hex Value: 10B0FA4E E4550000
[2024-01-12 16:21:39.066555998]POCL: in fn POclSetKernelArg at line 107:
| GENERAL | Kernel dot_product || SetArg idx 2 || float* || Local 0 || Size 8 || Value 0x7ffc19fb5ab0 || Pointer 0x55e44efab1c0 || *(uint32*)Value: 0 || *(uint64*)Value: 0 ||
Hex Value: C0B1FA4E E4550000
[2024-01-12 16:21:39.066569819]POCL: in fn pocl_kernel_calc_wg_size at line 182:
| GENERAL | Preparing kernel dot_product with local size 2 x 1 x 1 group sizes 2 x 1 x 1...
[2024-01-12 16:21:39.066798759]POCL: in fn pocl_network_create_buffer at line 2065:
| ERROR | Reply is FAIL: -34
[2024-01-12 16:21:39.066812847]POCL: in fn can_run_command at line 608:
| ERROR | Failed to allocate 40 bytes on device AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
[2024-01-12 16:21:39.066825166]POCL: in fn pocl_ndrange_kernel_common at line 393:
| ERROR | errcode Error constructing command struct
[2024-01-12 16:21:39.066834381]POCL: in fn POclEnqueueNDRangeKernel at line 71:
| ERROR | errcode errcode != CL_SUCCESS
CL_OUT_OF_RESOURCES in exec_dot_product_kernel on line 58
On my Arch Linux x86_64 low resource test machine, I built POCL the following way and installed it:
cmake -DENABLE_HOST_CPU_DEVICES=0 -DENABLE_LLVM=0 -DENABLE_LOADABLE_DRIVERS=0 -DENABLE_ICD=1 -DENABLE_REMOTE_CLIENT=1 -DENABLE_REMOTE_SERVER=1 ..
On the same machine I opened 2 terminals.
In terminal 1 I executed
pocld -a localhost -p 7777
On the other terminal 2 I did:
export OCL_ICD_VENDORS=$PWD/ocl-vendors/pocl-tests.icd
export POCL_DEVICES=remote
export POCL_REMOTE0_PARAMETERS=localhost:7777/0
clinfo with pocl env set:
clinfo
Number of platforms 1
Platform Name Portable Computing Language
Platform Vendor The pocl project
Platform Version OpenCL 3.0 PoCL 5.0 Linux, RelWithDebInfo, without LLVM, REMOTE, POCL_DEBUG
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_pocl_content_size
Platform Extensions with Version cl_khr_icd 0x400000 (1.0.0)
cl_pocl_content_size 0x400000 (1.0.0)
Platform Numeric Version 0xc00000 (3.0.0)
Platform Extensions function suffix POCL
Platform Host timer resolution 0ns
Platform Name Portable Computing Language
Number of devices 1
Device Name AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 23.2.1-arch1.2 HSTR: pocl-remote: AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
Device Numeric Version 0x401000 (1.1.0)
Driver Version 23.2.1-arch1.2
Device OpenCL C Version OpenCL C 1.2 PoCL
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Max compute units 3
Max clock frequency 800MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple (kernel) 0
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 0 (n/a)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 8099889152 (7.544GiB)
Error Correction support No
Max memory allocation 2024972288 (1.886GiB)
Unified memory for Host and Device No
Minimum alignment for any data type 128 bytes
Alignment of base address 32768 bits (4096 bytes)
Global Memory cache type None
Image support No
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 16
Max constant buffer size 67108864 (64MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling No
Profiling timer resolution 0ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
ILs with version (n/a)
Built-in kernels with version (n/a)
Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_extended_versioning
Device Extensions with Version
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Portable Computing Language
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [POCL]
clCreateContext(NULL, ...) [default] Success [POCL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Portable Computing Language
Device Name AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Portable Computing Language
Device Name AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Portable Computing Language
Device Name AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.3.2
ICD loader Profile OpenCL 3.0
clinfo without pocl env set:
clinfo
Number of platforms 2
Platform Name Clover
Platform Vendor Mesa
Platform Version OpenCL 1.1 Mesa 23.2.1-arch1.2
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd
Platform Extensions function suffix MESA
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.1 AMD-APP (3513.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback
Platform Extensions function suffix AMD
Platform Host timer resolution 1ns
Platform Name Clover
Number of devices 1
Device Name AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 23.2.1-arch1.2
Device Numeric Version 0x401000 (1.1.0)
Driver Version 23.2.1-arch1.2
Device OpenCL C Version OpenCL C 1.1
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Max compute units 3
Max clock frequency 800MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple (kernel) 64
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 0 (n/a)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 8099889152 (7.544GiB)
Error Correction support No
Max memory allocation 2024972288 (1.886GiB)
Unified memory for Host and Device No
Minimum alignment for any data type 128 bytes
Alignment of base address 32768 bits (4096 bytes)
Global Memory cache type None
Image support No
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 16
Max constant buffer size 67108864 (64MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Profiling timer resolution 0ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
ILs with version SPIR-V 0x400000 (1.0.0)
Built-in kernels with version (n/a)
Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_extended_versioning
Device Extensions with Version cl_khr_byte_addressable_store 0x400000 (1.0.0)
cl_khr_global_int32_base_atomics 0x400000 (1.0.0)
cl_khr_global_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_local_int32_base_atomics 0x400000 (1.0.0)
cl_khr_local_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_int64_base_atomics 0x400000 (1.0.0)
cl_khr_int64_extended_atomics 0x400000 (1.0.0)
cl_khr_fp64 0x400000 (1.0.0)
cl_khr_extended_versioning 0x400000 (1.0.0)
Platform Name AMD Accelerated Parallel Processing
Number of devices 0
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Clover
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [MESA]
clCreateContext(NULL, ...) [default] Success [MESA]
clCreateContext(NULL, ...) [other]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Clover
Device Name AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Clover
Device Name AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Clover
Device Name AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.3.2
ICD loader Profile OpenCL 3.0
About this issue
- Original URL
- State: open
- Created 6 months ago
- Comments: 22 (10 by maintainers)
Not sure. I’d probably start by poking around mesa’s data structures and cross-reference with the source tree if this actually is a deadlock situation. Building mesa with thread sanitizer would be even nicer but I’m not sure how feasible that is.
Client side looks like it’s just waiting for the server to reply that the previous command(s) have finished running. The main thing that looks suspicious here is the
pthread_mutex_lock
in libMesaOpenCL’sclGetEventInfo
implementation. But that is the only such call visible here so it’s not certain if that really is where it hangs. Having a look at the libMesaOpenCL.so code might shed some light on that.pocld:
pocl client:
Can you send a ‘thread apply all bt’ gdb dump of the deadlock situation both on the client and the server?
Out of Order Queues are not used any more since only NVIDIA and PoCL-CPU seem to advertise support for them. I don’t see anything out of place in this latest server log. There have indeed been some deadlocks, although those should be sorted by now… Unless of course they have simply become better deadlocks that don’t trip the thread sanitizer 😄
AFAIK Clover has received exactly zero testing. ROCm should work – at least it did at one point. Rusticl (iris) from Mesa 23.3.0 appears to panic when pocld tries to build the kernel from example1.
Just in case, I recommend checking if it works when running pocld on top of PoCL-CPU.
Your build options look fine. It works here (PoCL-R is tested also with CI runners). We most frequently test with PoCL-CPU, an Intel iGPU’s OpenCL driver and NVIDIA GPU drivers. AMD GPUs or Clover CPU drivers have received less testing. It doesn’t sound like a driver specific hang though, but I recall there used to be such in the past (related to event handling and OOQ – does it ring bells @jansol?).
Yeah the socket polling code sometimes thinks there is something to read but then ends up with 0 bytes and if it happens in the wrong spot it can fail to notice that and just keeps “reading” “empty” requests in a loop. Should get fixed with the aforementioned PR.