math: OpenCL - buffer creation fails with Integrated Graphics
Description
Attempting to run OpenCL code with Intel Integrated graphics gives the error:
[ RUN ] ProbDistributionsBernoulliCdf.opencl_matches_cpu_small
exception thrown in signature const std::tuple<std::vector<int, std::allocator<int> >, Eigen::Matrix<stan::math::var_value<double, void>, -1, 1, 0, -1, 1> >&]:
unknown file: Failure
C++ exception with description "initialize_buffer: clCreateBuffer CL_INVALID_HOST_PTR: Unknown error -37" thrown in the test body.
Which has also been seen in this forum post.
I believe this is due to the use of the CL_MEM_USE_HOST_PTR flag when creating matrix_cl objects:
buffer_cl_
= cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
sizeof(T) * size(), A); // this is always synchronous
Sharing the memory pointer to the GPU appears to cause an issue with integrated graphics that have shared memory for the CPU and GPU, and the CL_MEM_COPY_HOST_PTR flag should be used for these devices instead (and fixes the issue for me locally).
Given that this incurs a copy-cost, we wouldn’t want to use it for all cases, only for integrated graphics. I think a simple approach here would be to add a new make/local flag: INTEGRATED_OPENCL, which would request the use of this CL_MEM_COPY_HOST_PTR flag:
#ifdef INTEGRATED_OPENCL
buffer_cl_
= cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(T) * size(), A); // this is always synchronous
#else
buffer_cl_
= cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
sizeof(T) * size(), A); // this is always synchronous
#endif
@SteveBronder @rok-cesnovar I’m completely new to the OpenCL space so make sure to let me know if there’s a better or more performant option here!
Current Version:
v4.4.0
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 15
Your bitfield magic seems to work like a charm. I’ll finish out running all of the OpenCL tests locally to make sure that the integrated implementation doesn’t have any hidden issues
Thanks for finding this! What if we pulled out the bitfield like
Then I think putting that in the mem flags would work?