pocl: POCL Remote fails on AMD APU with mesa clover radeonsi driver on linux

The examples and all other programs that use opencl do not work with pocl remote. But they work otherwise on terminals that do not have the pocl env set.

The error that I got from running example1 is the following:

./example1 
[2024-01-12 16:21:38.761070069]POCL: in fn POclCreateCommandQueue at line 103:
  |   GENERAL |  Created Command Queue 3 (0x55e44efa4b70) on device 0
[2024-01-12 16:21:38.761169121]POCL: in fn POclCreateContext at line 232:
  |   GENERAL |  Created Context 2 (0x55e44ef809e0)
[2024-01-12 16:21:38.761610756]POCL: in fn POclCreateCommandQueue at line 103:
  |   GENERAL |  Created Command Queue 4 (0x55e44efa4c70) on device 0
[2024-01-12 16:21:39.066505250]POCL: in fn POclCreateKernel at line 133:
  |   GENERAL |  Created Kernel dot_product (0x55e44efab3b0)
[2024-01-12 16:21:39.066534969]POCL: in fn POclSetKernelArg at line 107:
  |   GENERAL |  Kernel     dot_product || SetArg idx   0 ||  float4* || Local 0 || Size      8 || Value 0x7ffc19fb5aa0 || Pointer 0x55e44efa9d00 || *(uint32*)Value:        0 || *(uint64*)Value:        0 ||
Hex Value:  009DFA4E E4550000
[2024-01-12 16:21:39.066547048]POCL: in fn POclSetKernelArg at line 107:
  |   GENERAL |  Kernel     dot_product || SetArg idx   1 ||  float4* || Local 0 || Size      8 || Value 0x7ffc19fb5aa8 || Pointer 0x55e44efab010 || *(uint32*)Value:        0 || *(uint64*)Value:        0 ||
Hex Value:  10B0FA4E E4550000
[2024-01-12 16:21:39.066555998]POCL: in fn POclSetKernelArg at line 107:
  |   GENERAL |  Kernel     dot_product || SetArg idx   2 ||   float* || Local 0 || Size      8 || Value 0x7ffc19fb5ab0 || Pointer 0x55e44efab1c0 || *(uint32*)Value:        0 || *(uint64*)Value:        0 ||
Hex Value:  C0B1FA4E E4550000
[2024-01-12 16:21:39.066569819]POCL: in fn pocl_kernel_calc_wg_size at line 182:
  |   GENERAL |  Preparing kernel dot_product with local size 2 x 1 x 1 group sizes 2 x 1 x 1...
[2024-01-12 16:21:39.066798759]POCL: in fn pocl_network_create_buffer at line 2065:
  |     ERROR |  Reply is FAIL: -34
[2024-01-12 16:21:39.066812847]POCL: in fn can_run_command at line 608:
  |     ERROR |  Failed to allocate 40 bytes on device AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
[2024-01-12 16:21:39.066825166]POCL: in fn pocl_ndrange_kernel_common at line 393:
  |     ERROR | errcode Error constructing command struct
[2024-01-12 16:21:39.066834381]POCL: in fn POclEnqueueNDRangeKernel at line 71:
  |     ERROR | errcode errcode != CL_SUCCESS
CL_OUT_OF_RESOURCES in exec_dot_product_kernel on line 58

On my Arch Linux x86_64 low resource test machine, I built POCL the following way and installed it:

cmake -DENABLE_HOST_CPU_DEVICES=0 -DENABLE_LLVM=0 -DENABLE_LOADABLE_DRIVERS=0 -DENABLE_ICD=1 -DENABLE_REMOTE_CLIENT=1 -DENABLE_REMOTE_SERVER=1 ..

On the same machine I opened 2 terminals.

In terminal 1 I executed

pocld -a localhost -p 7777

On the other terminal 2 I did:

export OCL_ICD_VENDORS=$PWD/ocl-vendors/pocl-tests.icd
export POCL_DEVICES=remote
export POCL_REMOTE0_PARAMETERS=localhost:7777/0

clinfo with pocl env set:

clinfo 
Number of platforms                               1
  Platform Name                                   Portable Computing Language
  Platform Vendor                                 The pocl project
  Platform Version                                OpenCL 3.0 PoCL 5.0  Linux, RelWithDebInfo, without LLVM, REMOTE, POCL_DEBUG
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_pocl_content_size
  Platform Extensions with Version                cl_khr_icd                                                       0x400000 (1.0.0)
                                                  cl_pocl_content_size                                             0x400000 (1.0.0)
  Platform Numeric Version                        0xc00000 (3.0.0)
  Platform Extensions function suffix             POCL
  Platform Host timer resolution                  0ns

  Platform Name                                   Portable Computing Language
Number of devices                                 1
  Device Name                                     AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
  Device Vendor                                   AMD
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.1 Mesa 23.2.1-arch1.2 HSTR: pocl-remote: AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
  Device Numeric Version                          0x401000 (1.1.0)
  Driver Version                                  23.2.1-arch1.2
  Device OpenCL C Version                         OpenCL C 1.2 PoCL
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Max compute units                               3
  Max clock frequency                             800MHz
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple (kernel)     0
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 0        (n/a)
    float                                                4 / 4       
    double                                               2 / 2        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              8099889152 (7.544GiB)
  Error Correction support                        No
  Max memory allocation                           2024972288 (1.886GiB)
  Unified memory for Host and Device              No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       32768 bits (4096 bytes)
  Global Memory cache type                        None
  Image support                                   No
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Max number of constant args                     16
  Max constant buffer size                        67108864 (64MiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     No
  Profiling timer resolution                      0ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    ILs with version                              (n/a)
  Built-in kernels with version                   (n/a)
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_extended_versioning
  Device Extensions with Version                  

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Portable Computing Language
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [POCL]
  clCreateContext(NULL, ...) [default]            Success [POCL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.3.2
  ICD loader Profile                              OpenCL 3.0

clinfo without pocl env set:

clinfo 
Number of platforms                               2
  Platform Name                                   Clover
  Platform Vendor                                 Mesa
  Platform Version                                OpenCL 1.1 Mesa 23.2.1-arch1.2
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             MESA

  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3513.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback 
  Platform Extensions function suffix             AMD
  Platform Host timer resolution                  1ns

  Platform Name                                   Clover
Number of devices                                 1
  Device Name                                     AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
  Device Vendor                                   AMD
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.1 Mesa 23.2.1-arch1.2
  Device Numeric Version                          0x401000 (1.1.0)
  Driver Version                                  23.2.1-arch1.2
  Device OpenCL C Version                         OpenCL C 1.1 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Max compute units                               3
  Max clock frequency                             800MHz
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple (kernel)     64
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 0        (n/a)
    float                                                4 / 4       
    double                                               2 / 2        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              8099889152 (7.544GiB)
  Error Correction support                        No
  Max memory allocation                           2024972288 (1.886GiB)
  Unified memory for Host and Device              No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       32768 bits (4096 bytes)
  Global Memory cache type                        None
  Image support                                   No
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Max number of constant args                     16
  Max constant buffer size                        67108864 (64MiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Profiling timer resolution                      0ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    ILs with version                              SPIR-V                                                           0x400000 (1.0.0)
  Built-in kernels with version                   (n/a)
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_extended_versioning
  Device Extensions with Version                  cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_khr_int64_base_atomics                                        0x400000 (1.0.0)
                                                  cl_khr_int64_extended_atomics                                    0x400000 (1.0.0)
                                                  cl_khr_fp64                                                      0x400000 (1.0.0)
                                                  cl_khr_extended_versioning                                       0x400000 (1.0.0)

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Clover
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [MESA]
  clCreateContext(NULL, ...) [default]            Success [MESA]
  clCreateContext(NULL, ...) [other]              
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Clover
    Device Name                                   AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Clover
    Device Name                                   AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Clover
    Device Name                                   AMD Radeon R5 Graphics (stoney, LLVM 16.0.6, DRM 3.54, 6.5.5-arch1-1)

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.3.2
  ICD loader Profile                              OpenCL 3.0

About this issue

  • Original URL
  • State: open
  • Created 6 months ago
  • Comments: 22 (10 by maintainers)

Most upvoted comments

What should I be looking for once that is built? IR dumps?

Not sure. I’d probably start by poking around mesa’s data structures and cross-reference with the source tree if this actually is a deadlock situation. Building mesa with thread sanitizer would be even nicer but I’m not sure how feasible that is.

Client side looks like it’s just waiting for the server to reply that the previous command(s) have finished running. The main thing that looks suspicious here is the pthread_mutex_lock in libMesaOpenCL’s clGetEventInfo implementation. But that is the only such call visible here so it’s not certain if that really is where it hangs. Having a look at the libMesaOpenCL.so code might shed some light on that.

pocld:

Starting program: /usr/local/bin/pocld -a localhost -p 7777

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff79056c0 (LWP 142502)]
[New Thread 0x7ffff71046c0 (LWP 142503)]
[New Thread 0x7ffff69036c0 (LWP 142524)]
[New Thread 0x7ffff61026c0 (LWP 142525)]
[New Thread 0x7ffff59016c0 (LWP 142526)]
[New Thread 0x7fffd89ff6c0 (LWP 142527)]
[New Thread 0x7fffd3fff6c0 (LWP 142528)]
[New Thread 0x7fffd37fe6c0 (LWP 142529)]
[New Thread 0x7fffd2ffd6c0 (LWP 142530)]
[New Thread 0x7fffd23fc6c0 (LWP 142531)]

Thread 1 "pocld" received signal SIGINT, Interrupt.
0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6

Thread 11 (Thread 0x7fffd23fc6c0 (LWP 142531) "pocld"):
#0  0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff79bd055 in pthread_cond_timedwait () from /usr/lib/libc.so.6
#2  0x00005555555de6c4 in __gthread_cond_timedwait (__cond=0x7ffff0001928, __mutex=0x7ffff0001958, __abs_timeout=0x7fffd23fbcd0) at /usr/include/c++/13.2.1/x86_64-pc-linux-gnu/bits/gthr-default.h:872
#3  0x00005555555de7de in std::__condvar::wait_until (this=0x7ffff0001928, __m=..., __abs_time=...) at /usr/include/c++/13.2.1/bits/std_mutex.h:178
#4  0x00005555555e2272 in std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7ffff0001928, __lock=..., __atime=std::chrono::sys_time = { 1705497480553547344ns [2024-01-17 13:18:00] }) at /usr/include/c++/13.2.1/condition_variable:224
#5  0x00005555555e06d3 in std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7ffff0001928, __lock=..., __atime=std::chrono::sys_time = { 1705497480553547344ns [2024-01-17 13:18:00] }) at /usr/include/c++/13.2.1/condition_variable:120
#6  0x00005555555d975c in VirtualCLContext::run (this=0x7ffff0001810) at /home/linux/experiments/pocl/pocld/virtual_cl_context.cc:563
#7  0x00005555555de54d in startVirtualContextMainloop (ctx=0x7ffff0001810) at /home/linux/experiments/pocl/pocld/virtual_cl_context.cc:1120
#8  0x0000555555578092 in std::__invoke_impl<void, void (*)(VirtualContextBase*), VirtualContextBase*> (__f=@0x7ffff01f11d0: 0x5555555de52a <startVirtualContextMainloop(VirtualContextBase*)>) at /usr/include/c++/13.2.1/bits/invoke.h:61
#9  0x0000555555577ee8 in std::__invoke<void (*)(VirtualContextBase*), VirtualContextBase*> (__fn=@0x7ffff01f11d0: 0x5555555de52a <startVirtualContextMainloop(VirtualContextBase*)>) at /usr/include/c++/13.2.1/bits/invoke.h:96
#10 0x0000555555577dc5 in std::thread::_Invoker<std::tuple<void (*)(VirtualContextBase*), VirtualContextBase*> >::_M_invoke<0ul, 1ul> (this=0x7ffff01f11c8) at /usr/include/c++/13.2.1/bits/std_thread.h:292
#11 0x0000555555577d48 in std::thread::_Invoker<std::tuple<void (*)(VirtualContextBase*), VirtualContextBase*> >::operator() (this=0x7ffff01f11c8) at /usr/include/c++/13.2.1/bits/std_thread.h:299
#12 0x0000555555577cec in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(VirtualContextBase*), VirtualContextBase*> > >::_M_run (this=0x7ffff01f11c0) at /usr/include/c++/13.2.1/bits/std_thread.h:244
#13 0x00007ffff7ce1943 in std::execute_native_thread_routine (__p=0x7ffff01f11c0) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#14 0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#15 0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 10 (Thread 0x7fffd2ffd6c0 (LWP 142530) "pocld:shlo0"):
#0  0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff79bcd40 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x00007fffd92f48ac in ?? () from /usr/lib/gallium-pipe/pipe_radeonsi.so
#3  0x00007fffd9339a0c in ?? () from /usr/lib/gallium-pipe/pipe_radeonsi.so
#4  0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#5  0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 9 (Thread 0x7fffd37fe6c0 (LWP 142529) "pocld:sh0"):
#0  0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff79bcd40 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x00007fffd92f48ac in ?? () from /usr/lib/gallium-pipe/pipe_radeonsi.so
#3  0x00007fffd9339a0c in ?? () from /usr/lib/gallium-pipe/pipe_radeonsi.so
#4  0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#5  0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 8 (Thread 0x7fffd3fff6c0 (LWP 142528) "pocld:disk$0"):
#0  0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff79bcd40 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x00007fffd92f48ac in ?? () from /usr/lib/gallium-pipe/pipe_radeonsi.so
#3  0x00007fffd9339a0c in ?? () from /usr/lib/gallium-pipe/pipe_radeonsi.so
#4  0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#5  0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 7 (Thread 0x7fffd89ff6c0 (LWP 142527) "pocld:cs0"):
#0  0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff79bcd40 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x00007fffd92f48ac in ?? () from /usr/lib/gallium-pipe/pipe_radeonsi.so
#3  0x00007fffd9339a0c in ?? () from /usr/lib/gallium-pipe/pipe_radeonsi.so
#4  0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#5  0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 6 (Thread 0x7ffff59016c0 (LWP 142526) "pocld"):
#0  0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff79bd325 in pthread_cond_clockwait () from /usr/lib/libc.so.6
#2  0x00005555555f65c5 in std::__condvar::wait_until (this=0x7ffff00016f0, __m=..., __clock=1, __abs_time=...) at /usr/include/c++/13.2.1/bits/std_mutex.h:185
#3  0x00005555555f7b10 in std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7ffff00016f0, __lock=..., __atime=std::chrono::_V2::steady_clock time_point = { 412937973520249ns }) at /usr/include/c++/13.2.1/condition_variable:203
#4  0x00005555555f72b1 in std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7ffff00016f0, __lock=..., __atime=std::chrono::_V2::steady_clock time_point = { 412937973520249ns }) at /usr/include/c++/13.2.1/condition_variable:113
#5  0x00005555555f6c4a in std::condition_variable::wait_for<long, std::ratio<1l, 1l> > (this=0x7ffff00016f0, __lock=..., __rtime=std::chrono::duration = { 1s }) at /usr/include/c++/13.2.1/condition_variable:165
#6  0x00005555555f5905 in PeerHandler::handle_incoming_peers (this=0x7ffff0002360) at /home/linux/experiments/pocl/pocld/peer_handler.cc:171
#7  0x00005555555f90b2 in std::__invoke_impl<void, void (PeerHandler::*)(), PeerHandler*> (__f=@0x7ffff0002420: (void (PeerHandler::*)(PeerHandler * const)) 0x5555555f585e <PeerHandler::handle_incoming_peers()>, __t=@0x7ffff0002418: 0x7ffff0002360) at /usr/include/c++/13.2.1/bits/invoke.h:74
#8  0x00005555555f9011 in std::__invoke<void (PeerHandler::*)(), PeerHandler*> (__fn=@0x7ffff0002420: (void (PeerHandler::*)(PeerHandler * const)) 0x5555555f585e <PeerHandler::handle_incoming_peers()>) at /usr/include/c++/13.2.1/bits/invoke.h:96
#9  0x00005555555f8f81 in std::thread::_Invoker<std::tuple<void (PeerHandler::*)(), PeerHandler*> >::_M_invoke<0ul, 1ul> (this=0x7ffff0002418) at /usr/include/c++/13.2.1/bits/std_thread.h:292
#10 0x00005555555f8f3a in std::thread::_Invoker<std::tuple<void (PeerHandler::*)(), PeerHandler*> >::operator() (this=0x7ffff0002418) at /usr/include/c++/13.2.1/bits/std_thread.h:299
#11 0x00005555555f8f1e in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (PeerHandler::*)(), PeerHandler*> > >::_M_run (this=0x7ffff0002410) at /usr/include/c++/13.2.1/bits/std_thread.h:244
#12 0x00007ffff7ce1943 in std::execute_native_thread_routine (__p=0x7ffff0002410) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#13 0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#14 0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 5 (Thread 0x7ffff61026c0 (LWP 142525) "pocld"):
#0  0x00007ffff79c0e01 in pthread_mutex_lock () from /usr/lib/libc.so.6
#1  0x00007ffff4a72ae7 in ?? () from /usr/lib/libMesaOpenCL.so.1
#2  0x00007ffff4a4e0a7 in ?? () from /usr/lib/libMesaOpenCL.so.1
#3  0x00007ffff7f63980 in clGetEventInfo () from /usr/lib/libOpenCL.so.1
#4  0x00005555555cedf1 in cl::detail::GetInfoFunctor0<int (*)(_cl_event*, unsigned int, unsigned long, void*, unsigned long*), _cl_event*>::operator() (this=0x7ffff6100b40, param=4563, size=4, value=0x7ffff6100bf0, size_ret=0x0) at /home/linux/experiments/pocl/pocld/../include/hpp/CL/opencl.hpp:1801
#5  0x00005555555c6cb3 in cl::detail::getInfoHelper<cl::detail::GetInfoFunctor0<int (*)(_cl_event*, unsigned int, unsigned long, void*, unsigned long*), _cl_event*>, int> (f=..., name=4563, param=0x7ffff6100bf0) at /home/linux/experiments/pocl/pocld/../include/hpp/CL/opencl.hpp:1096
#6  0x00005555555be1ef in cl::detail::getInfo<int (*)(_cl_event*, unsigned int, unsigned long, void*, unsigned long*), _cl_event*, int> (f=0x7ffff7f63910 <clGetEventInfo>, arg0=@0x7ffff01f0840: 0x7ffff00b46a0, name=4563, param=0x7ffff6100bf0) at /home/linux/experiments/pocl/pocld/../include/hpp/CL/opencl.hpp:1818
#7  0x00005555555b5ba8 in cl::Event::getInfo<int> (this=0x7ffff01f0840, name=4563, param=0x7ffff6100bf0) at /home/linux/experiments/pocl/pocld/../include/hpp/CL/opencl.hpp:3474
#8  0x00005555555ac4c8 in cl::Event::getInfo<4563u> (this=0x7ffff01f0840, err=0x0) at /home/linux/experiments/pocl/pocld/../include/hpp/CL/opencl.hpp:3486
#9  0x00005555555f0305 in ReplyQueueThread::writeThread (this=0x7ffff0002100) at /home/linux/experiments/pocl/pocld/reply_th.cc:176
#10 0x00005555555f3770 in std::__invoke_impl<void, void (ReplyQueueThread::*)(), ReplyQueueThread*> (__f=@0x7ffff00021d0: (void (ReplyQueueThread::*)(ReplyQueueThread * const)) 0x5555555eff2a <ReplyQueueThread::writeThread()>, __t=@0x7ffff00021c8: 0x7ffff0002100) at /usr/include/c++/13.2.1/bits/invoke.h:74
#11 0x00005555555f36cf in std::__invoke<void (ReplyQueueThread::*)(), ReplyQueueThread*> (__fn=@0x7ffff00021d0: (void (ReplyQueueThread::*)(ReplyQueueThread * const)) 0x5555555eff2a <ReplyQueueThread::writeThread()>) at /usr/include/c++/13.2.1/bits/invoke.h:96
#12 0x00005555555f363f in std::thread::_Invoker<std::tuple<void (ReplyQueueThread::*)(), ReplyQueueThread*> >::_M_invoke<0ul, 1ul> (this=0x7ffff00021c8) at /usr/include/c++/13.2.1/bits/std_thread.h:292
#13 0x00005555555f35f8 in std::thread::_Invoker<std::tuple<void (ReplyQueueThread::*)(), ReplyQueueThread*> >::operator() (this=0x7ffff00021c8) at /usr/include/c++/13.2.1/bits/std_thread.h:299
#14 0x00005555555f35dc in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (ReplyQueueThread::*)(), ReplyQueueThread*> > >::_M_run (this=0x7ffff00021c0) at /usr/include/c++/13.2.1/bits/std_thread.h:244
#15 0x00007ffff7ce1943 in std::execute_native_thread_routine (__p=0x7ffff00021c0) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#16 0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#17 0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 4 (Thread 0x7ffff69036c0 (LWP 142524) "pocld"):
#0  0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff79bd055 in pthread_cond_timedwait () from /usr/lib/libc.so.6
#2  0x00005555555de6c4 in __gthread_cond_timedwait (__cond=0x7ffff0001f10, __mutex=0x7ffff0001ee8, __abs_timeout=0x7ffff6901ba0) at /usr/include/c++/13.2.1/x86_64-pc-linux-gnu/bits/gthr-default.h:872
#3  0x00005555555de7de in std::__condvar::wait_until (this=0x7ffff0001f10, __m=..., __abs_time=...) at /usr/include/c++/13.2.1/bits/std_mutex.h:178
#4  0x00005555555e2272 in std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7ffff0001f10, __lock=..., __atime=std::chrono::sys_time = { 1705497479898674125ns [2024-01-17 13:17:59] }) at /usr/include/c++/13.2.1/condition_variable:224
#5  0x00005555555e06d3 in std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7ffff0001f10, __lock=..., __atime=std::chrono::sys_time = { 1705497479898674125ns [2024-01-17 13:17:59] }) at /usr/include/c++/13.2.1/condition_variable:120
#6  0x00005555555f0de0 in ReplyQueueThread::writeThread (this=0x7ffff0001ea0) at /home/linux/experiments/pocl/pocld/reply_th.cc:299
#7  0x00005555555f3770 in std::__invoke_impl<void, void (ReplyQueueThread::*)(), ReplyQueueThread*> (__f=@0x7ffff0001f70: (void (ReplyQueueThread::*)(ReplyQueueThread * const)) 0x5555555eff2a <ReplyQueueThread::writeThread()>, __t=@0x7ffff0001f68: 0x7ffff0001ea0) at /usr/include/c++/13.2.1/bits/invoke.h:74
#8  0x00005555555f36cf in std::__invoke<void (ReplyQueueThread::*)(), ReplyQueueThread*> (__fn=@0x7ffff0001f70: (void (ReplyQueueThread::*)(ReplyQueueThread * const)) 0x5555555eff2a <ReplyQueueThread::writeThread()>) at /usr/include/c++/13.2.1/bits/invoke.h:96
#9  0x00005555555f363f in std::thread::_Invoker<std::tuple<void (ReplyQueueThread::*)(), ReplyQueueThread*> >::_M_invoke<0ul, 1ul> (this=0x7ffff0001f68) at /usr/include/c++/13.2.1/bits/std_thread.h:292
#10 0x00005555555f35f8 in std::thread::_Invoker<std::tuple<void (ReplyQueueThread::*)(), ReplyQueueThread*> >::operator() (this=0x7ffff0001f68) at /usr/include/c++/13.2.1/bits/std_thread.h:299
#11 0x00005555555f35dc in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (ReplyQueueThread::*)(), ReplyQueueThread*> > >::_M_run (this=0x7ffff0001f60) at /usr/include/c++/13.2.1/bits/std_thread.h:244
#12 0x00007ffff7ce1943 in std::execute_native_thread_routine (__p=0x7ffff0001f60) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#13 0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#14 0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 3 (Thread 0x7ffff71046c0 (LWP 142503) "pocld"):
#0  0x00007ffff7a33f6f in poll () from /usr/lib/libc.so.6
#1  0x00005555555699e1 in PoclDaemon::readAllClientSocketsThread (this=0x7fffffffdac0) at /home/linux/experiments/pocl/pocld/pocld.cc:590
#2  0x0000555555578133 in std::__invoke_impl<void, void (PoclDaemon::*)(), PoclDaemon*> (__f=@0x55555565a680: (void (PoclDaemon::*)(PoclDaemon * const)) 0x55555556974e <PoclDaemon::readAllClientSocketsThread()>, __t=@0x55555565a678: 0x7fffffffdac0) at /usr/include/c++/13.2.1/bits/invoke.h:74
#3  0x0000555555577f78 in std::__invoke<void (PoclDaemon::*)(), PoclDaemon*> (__fn=@0x55555565a680: (void (PoclDaemon::*)(PoclDaemon * const)) 0x55555556974e <PoclDaemon::readAllClientSocketsThread()>) at /usr/include/c++/13.2.1/bits/invoke.h:96
#4  0x0000555555577e0f in std::thread::_Invoker<std::tuple<void (PoclDaemon::*)(), PoclDaemon*> >::_M_invoke<0ul, 1ul> (this=0x55555565a678) at /usr/include/c++/13.2.1/bits/std_thread.h:292
#5  0x0000555555577d64 in std::thread::_Invoker<std::tuple<void (PoclDaemon::*)(), PoclDaemon*> >::operator() (this=0x55555565a678) at /usr/include/c++/13.2.1/bits/std_thread.h:299
#6  0x0000555555577d0c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (PoclDaemon::*)(), PoclDaemon*> > >::_M_run (this=0x55555565a670) at /usr/include/c++/13.2.1/bits/std_thread.h:244
#7  0x00007ffff7ce1943 in std::execute_native_thread_routine (__p=0x55555565a670) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#8  0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#9  0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 2 (Thread 0x7ffff79056c0 (LWP 142502) "pocld"):
#0  0x00007ffff7a4330f in accept () from /usr/lib/libc.so.6
#1  0x000055555556769c in listen_peers (data=0x7fffffffdba8) at /home/linux/experiments/pocl/pocld/pocld.cc:164
#2  0x00005555555781ac in std::__invoke_impl<int, int (*)(void*), void*> (__f=@0x55555565b5e0: 0x555555567225 <listen_peers(void*)>) at /usr/include/c++/13.2.1/bits/invoke.h:61
#3  0x0000555555578008 in std::__invoke<int (*)(void*), void*> (__fn=@0x55555565b5e0: 0x555555567225 <listen_peers(void*)>) at /usr/include/c++/13.2.1/bits/invoke.h:96
#4  0x0000555555577e59 in std::thread::_Invoker<std::tuple<int (*)(void*), void*> >::_M_invoke<0ul, 1ul> (this=0x55555565b5d8) at /usr/include/c++/13.2.1/bits/std_thread.h:292
#5  0x0000555555577d80 in std::thread::_Invoker<std::tuple<int (*)(void*), void*> >::operator() (this=0x55555565b5d8) at /usr/include/c++/13.2.1/bits/std_thread.h:299
#6  0x0000555555577d2c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<int (*)(void*), void*> > >::_M_run (this=0x55555565b5d0) at /usr/include/c++/13.2.1/bits/std_thread.h:244
#7  0x00007ffff7ce1943 in std::execute_native_thread_routine (__p=0x55555565b5d0) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#8  0x00007ffff79bd9eb in ?? () from /usr/lib/libc.so.6
#9  0x00007ffff7a417cc in ?? () from /usr/lib/libc.so.6

Thread 1 (Thread 0x7ffff7f26740 (LWP 142496) "pocld"):
#0  0x00007ffff79ba4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff79bf5f3 in ?? () from /usr/lib/libc.so.6
#2  0x00007ffff7ce19b8 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:669
#3  std::thread::join (this=0x7fffffffdba0) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:134
#4  0x000055555556bd48 in PoclDaemon::waitForExit (this=0x7fffffffdac0) at /home/linux/experiments/pocl/pocld/pocld.cc:314
#5  0x000055555556b045 in main (argc=5, argv=0x7fffffffdd48) at /home/linux/experiments/pocl/pocld/pocld.cc:932

pocl client:

Starting program: /home/linux/experiments/pocl/build/examples/example1/example1 

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffe95ff6c0 (LWP 142532)]
[New Thread 0x7fffe8dfe6c0 (LWP 142533)]
[New Thread 0x7fffe3fff6c0 (LWP 142534)]
[New Thread 0x7fffe37fe6c0 (LWP 142535)]
[New Thread 0x7fffe2ffd6c0 (LWP 142536)]

Thread 1 "example1" received signal SIGINT, Interrupt.
0x00007ffff7d0a4ae in ?? () from /usr/lib/libc.so.6

Thread 6 (Thread 0x7fffe2ffd6c0 (LWP 142536) "example1"):
#0  0x00007ffff7d0a4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff7d0cd40 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x00007ffff78ab3a9 in pocl_remote_driver_pthread (cldev=0x5555555c9930) at /home/linux/experiments/pocl/lib/CL/devices/remote/remote.c:2318
#3  0x00007ffff7d0d9eb in ?? () from /usr/lib/libc.so.6
#4  0x00007ffff7d917cc in ?? () from /usr/lib/libc.so.6

Thread 5 (Thread 0x7fffe37fe6c0 (LWP 142535) "example1"):
#0  0x00007ffff7d0a4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff7d0d055 in pthread_cond_timedwait () from /usr/lib/libc.so.6
#2  0x00007ffff78af91e in pocl_remote_writer_pthread (aa=0x5555555c7c90) at /home/linux/experiments/pocl/lib/CL/devices/remote/communication.c:1219
#3  0x00007ffff7d0d9eb in ?? () from /usr/lib/libc.so.6
#4  0x00007ffff7d917cc in ?? () from /usr/lib/libc.so.6

Thread 4 (Thread 0x7fffe3fff6c0 (LWP 142534) "example1"):
#0  0x00007ffff7d0a4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff7d0d055 in pthread_cond_timedwait () from /usr/lib/libc.so.6
#2  0x00007ffff78af91e in pocl_remote_writer_pthread (aa=0x5555555c7c70) at /home/linux/experiments/pocl/lib/CL/devices/remote/communication.c:1219
#3  0x00007ffff7d0d9eb in ?? () from /usr/lib/libc.so.6
#4  0x00007ffff7d917cc in ?? () from /usr/lib/libc.so.6

Thread 3 (Thread 0x7fffe8dfe6c0 (LWP 142533) "example1"):
#0  0x00007ffff7d83f6f in poll () from /usr/lib/libc.so.6
#1  0x00007ffff78ae2fe in pocl_remote_reader_pthread (aa=0x5555555c9e10) at /home/linux/experiments/pocl/lib/CL/devices/remote/communication.c:722
#2  0x00007ffff7d0d9eb in ?? () from /usr/lib/libc.so.6
#3  0x00007ffff7d917cc in ?? () from /usr/lib/libc.so.6

Thread 2 (Thread 0x7fffe95ff6c0 (LWP 142532) "example1"):
#0  0x00007ffff7d83f6f in poll () from /usr/lib/libc.so.6
#1  0x00007ffff78ae2fe in pocl_remote_reader_pthread (aa=0x5555555cc900) at /home/linux/experiments/pocl/lib/CL/devices/remote/communication.c:722
#2  0x00007ffff7d0d9eb in ?? () from /usr/lib/libc.so.6
#3  0x00007ffff7d917cc in ?? () from /usr/lib/libc.so.6

Thread 1 (Thread 0x7ffff7c7e740 (LWP 142521) "example1"):
#0  0x00007ffff7d0a4ae in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff7d0cd40 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x00007ffff78a6e4a in pocl_remote_join (device=0x5555555c9930, cq=0x555555605630) at /home/linux/experiments/pocl/lib/CL/devices/remote/remote.c:1191
#3  0x00007ffff785364b in POclFinish (command_queue=0x555555605630) at /home/linux/experiments/pocl/lib/CL/clFinish.c:41
#4  0x00007ffff782d545 in POclEnqueueReadBuffer (command_queue=0x555555605630, buffer=0x55555560aa10, blocking_read=1, offset=0, size=16, ptr=0x555555606780, num_events_in_wait_list=0, event_wait_list=0x0, event=0x0) at /home/linux/experiments/pocl/lib/CL/clEnqueueReadBuffer.c:99
#5  0x00007ffff7f6410f in clEnqueueReadBuffer () from /usr/lib/libOpenCL.so.1
#6  0x0000555555556e82 in exec_dot_product_kernel (context=0x5555556031c0, device=0x5555555c9930, cmd_queue=0x555555605630, program=0x555555606bc0, n=4, srcA=0x555555605730, srcB=0x555555606730, dst=0x555555606780) at /home/linux/experiments/pocl/examples/example1/example1_exec.c:60
#7  0x000055555555676c in main (argc=1, argv=0x7fffffffdc78) at /home/linux/experiments/pocl/examples/example1/example1.c:109

Can you send a ‘thread apply all bt’ gdb dump of the deadlock situation both on the client and the server?

Out of Order Queues are not used any more since only NVIDIA and PoCL-CPU seem to advertise support for them. I don’t see anything out of place in this latest server log. There have indeed been some deadlocks, although those should be sorted by now… Unless of course they have simply become better deadlocks that don’t trip the thread sanitizer 😄

AMD GPUs or Clover have received less testing

AFAIK Clover has received exactly zero testing. ROCm should work – at least it did at one point. Rusticl (iris) from Mesa 23.3.0 appears to panic when pocld tries to build the kernel from example1.

Just in case, I recommend checking if it works when running pocld on top of PoCL-CPU.

Your build options look fine. It works here (PoCL-R is tested also with CI runners). We most frequently test with PoCL-CPU, an Intel iGPU’s OpenCL driver and NVIDIA GPU drivers. AMD GPUs or Clover CPU drivers have received less testing. It doesn’t sound like a driver specific hang though, but I recall there used to be such in the past (related to event handling and OOQ – does it ring bells @jansol?).

Yeah the socket polling code sometimes thinks there is something to read but then ends up with 0 bytes and if it happens in the wrong spot it can fail to notice that and just keeps “reading” “empty” requests in a loop. Should get fixed with the aforementioned PR.