oneMKL: `gemm` throws exception on PVC
Summary
I’m trying to use gemm on PVC, but it keeps throwing an exception. Please let me know where I’m going wrong.
I am attempting to use gemm and execute on a 4oam PVC system on ORTCE. I am getting an exception thrown with both production icpx and with the most recent version of intel/llvm, both compiled with production oneMKL.
A minimal reproducer is attached below.
sycl::queue q(sycl::default_selector_v);
T* a_d = sycl::malloc_device<T>(m * k, q);
T* b_d = sycl::malloc_device<T>(k * n, q);
T* c_d = sycl::malloc_device<T>(m * n, q);
std::vector<T> a_l(m*k);
std::vector<T> b_l(k*n);
std::vector<T> c_l(m*n, 0);
for (std::size_t i = 0; i < m*k; i++) {
a_l[i] = drand48();
}
for (std::size_t i = 0; i < k*n; i++) {
b_l[i] = drand48();
}
q.memcpy(a_d, a_l.data(), m*k*sizeof(T)).wait();
q.memcpy(b_d, b_l.data(), k*n*sizeof(T)).wait();
q.memcpy(c_d, c_l.data(), m*n*sizeof(T)).wait();
std::cout << "Running MKL gemm..." << std::endl;
auto event = oneapi::mkl::blas::row_major::gemm(q,
oneapi::mkl::transpose::nontrans,
oneapi::mkl::transpose::nontrans,
m, n, k,
T(1),
a_d, k,
b_d, n,
T(1),
c_d, n);
event.wait();
This throws the following exception:
(base) bbrock@sdp4452:~/src/issues/oneMKL_gemm$ ./gemm
Running MKL gemm...
terminate called after throwing an instance of 'sycl::_V1::exception'
what(): Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)
As far as I can tell, I am allocating enough memory, and all of the pointers I’m passing in are USM device pointers, which should be accessible on the device associated with the queue passed to oneMKL.
Version
I am using production oneMKL 2023.1.0.
Environment
I am running this on a machine with four PVC GPUs.
(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.15.3.0.20_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel (R) Xeon (R) CPU Max 9480 OpenCL 3.0 (Build 0) [2023.15.3.0.20_160000]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
I am using production oneMKL 2023.1.0.
I am getting this error with both the most recent commit of intel/llvm and with production icpx.
(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/../bin/icpx.cfg
Steps to reproduce
(base) bbrock@sdp125071:~/src/issues/oneMKL_gemm$ ./gemm
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
Running MKL gemm...
terminate called after throwing an instance of 'sycl::_V1::exception'
what(): Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)
Observed behavior
Throws an exception as above.
Expected behavior
I expect the kernel to execute successfully.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 16 (6 by maintainers)
I would vote for correct functionality + warning. 😃 Is it ok to close this issue?
Hi @maleadt - thanks for your work on oneAPI.jl! Intel oneMKL product currently requires the OpenCL GPU runtime even when the Level-Zero backend is used. Could you please install it and see if that resolves the issue?