llama.cpp: Unable to use Intel UHD GPU acceleration with BLAS

Expected Behavior

GPU should be used when infering

Current Behavior

Here’s how I built the software :

git clone https://github.com/ggerganov/llama.cpp . extracted w64devkit fortran somewhere, copied OpenBLAS required file in the folders ran w64devkit.exe cd to my llama.cpp folder make LLAMA_OPENBLAS=1

then I followed the “Intel MKL” section below :

mkdir build
cd build
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build . --config Release

Finally I ran the app with :

.\build\bin\Release\main.exe -m ./models/7B/ggml-model-q4_0.bin -n 128 --interactive-first --color --threads 4 --mlock

But my igpu is at 2-3%, while my cpu is at 70-80%, when infering. Generation is a few words per second, on 7B, which is not bad for a bad/intel laptop cpu

Environment and Context

Physical hardware Windows 11, laptop i7 8565U 4c/8t, 16gb ram, Intel UHD 620
Operating System Windows 11 22H2 Python 3.11.3 cmake version 3.26.4

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 18 (4 by maintainers)

Most upvoted comments

I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html

I think we should bring this issue back, iGPU offloading at least the prompt eval is very valuable

tikikun on Dec 2, 2023

I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html

ColonelPhantom on Jul 10, 2023

If you had a dedicated GPU, bringing down the prompt evaluation below 60s @ 1000 tokens is very much doable.

SlyEcho on Jun 13, 2023

it has no vram. it’s just the ram being used as vram the bios does allocate some for it, but it’s more for legacy purpose afaik, it will just use whatever.

the good thing is that you don’t need to copy data vram->ram to access the data on cpu, it’s just always shared by both

Le mar. 13 juin 2023, 23:20, Sunija @.***> a écrit :

The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don’t have or have very little.

According to the task manager there’s 8 GB of Shared GPU memory/GPU memory. Does that count as VRAM in that context? Or does the Intel UHD 620 just have no VRAM?

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/1761#issuecomment-1590041841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZVYVBV3ZGZSEIYHR66QKY3XLDKRZANCNFSM6AAAAAAY7NBDEE . You are receiving this because you authored the thread.Message ID: @.***>

Foul-Tarnished on Jun 13, 2023

The provided Windows build with CLBlast using OpenCL should work but I wouldn’t expect any significant performance gains from integrated graphics.

SlyEcho on Jun 8, 2023

Last I checked Intel MKL is a CPU only library. It will not use the IGP.

Also, AFAIK the “BLAS” part is only used for prompt processing. The actual text generation uses custom code for CPUs and accelerators.

You could load the IGP with clblast, but it might not actually speed things up because of the extra copies. There is not really a backend specifically targeting IGPs yet.

Yeah, the documentation is a bit lacking.

AlphaAtlas on Jun 8, 2023