oneMKL: `gemm` throws exception on PVC

Summary

I’m trying to use gemm on PVC, but it keeps throwing an exception. Please let me know where I’m going wrong.

I am attempting to use gemm and execute on a 4oam PVC system on ORTCE. I am getting an exception thrown with both production icpx and with the most recent version of intel/llvm, both compiled with production oneMKL.

A minimal reproducer is attached below.

  sycl::queue q(sycl::default_selector_v);

  T* a_d = sycl::malloc_device<T>(m * k, q);
  T* b_d = sycl::malloc_device<T>(k * n, q);
  T* c_d = sycl::malloc_device<T>(m * n, q);
  
  std::vector<T> a_l(m*k);
  std::vector<T> b_l(k*n);
  std::vector<T> c_l(m*n, 0);

  for (std::size_t i = 0; i < m*k; i++) {
    a_l[i] = drand48();
  }

  for (std::size_t i = 0; i < k*n; i++) {
    b_l[i] = drand48();
  }

  q.memcpy(a_d, a_l.data(), m*k*sizeof(T)).wait();
  q.memcpy(b_d, b_l.data(), k*n*sizeof(T)).wait();
  q.memcpy(c_d, c_l.data(), m*n*sizeof(T)).wait();

  std::cout << "Running MKL gemm..." << std::endl;

  auto event = oneapi::mkl::blas::row_major::gemm(q,
    oneapi::mkl::transpose::nontrans,
    oneapi::mkl::transpose::nontrans,
    m, n, k,
    T(1),
    a_d, k,
    b_d, n,
    T(1),
    c_d, n);
  event.wait();

This throws the following exception:

(base) bbrock@sdp4452:~/src/issues/oneMKL_gemm$ ./gemm
Running MKL gemm...
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)

As far as I can tell, I am allocating enough memory, and all of the pointers I’m passing in are USM device pointers, which should be accessible on the device associated with the queue passed to oneMKL.

Version

I am using production oneMKL 2023.1.0.

Environment

I am running this on a machine with four PVC GPUs.

(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.15.3.0.20_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel (R) Xeon (R) CPU Max 9480 OpenCL 3.0 (Build 0) [2023.15.3.0.20_160000]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]

I am using production oneMKL 2023.1.0.

I am getting this error with both the most recent commit of intel/llvm and with production icpx.

(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/../bin/icpx.cfg

Steps to reproduce

(base) bbrock@sdp125071:~/src/issues/oneMKL_gemm$ ./gemm
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
Running MKL gemm...
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)

Observed behavior

Throws an exception as above.

Expected behavior

I expect the kernel to execute successfully.

oneMKL_gemm.tar.gz

About this issue

Original URL
State: open
Created a year ago
Comments: 16 (6 by maintainers)

Most upvoted comments

I would vote for correct functionality + warning. 😃 Is it ok to close this issue?

mmeterel on Apr 15, 2024

Hi @maleadt - thanks for your work on oneAPI.jl! Intel oneMKL product currently requires the OpenCL GPU runtime even when the Level-Zero backend is used. Could you please install it and see if that resolves the issue?

sknepper on Apr 1, 2024