llama.cpp: Unable to use Intel UHD GPU acceleration with BLAS
Expected Behavior
GPU should be used when infering
Current Behavior
Here’s how I built the software :
git clone https://github.com/ggerganov/llama.cpp .
extracted w64devkit fortran somewhere, copied OpenBLAS required file in the folders
ran w64devkit.exe
cd to my llama.cpp folder
make LLAMA_OPENBLAS=1
then I followed the “Intel MKL” section below :
mkdir build
cd build
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build . --config Release
Finally I ran the app with :
.\build\bin\Release\main.exe -m ./models/7B/ggml-model-q4_0.bin -n 128 --interactive-first --color --threads 4 --mlock
But my igpu is at 2-3%, while my cpu is at 70-80%, when infering. Generation is a few words per second, on 7B, which is not bad for a bad/intel laptop cpu
Environment and Context
-
Physical hardware Windows 11, laptop i7 8565U 4c/8t, 16gb ram, Intel UHD 620
-
Operating System Windows 11 22H2 Python 3.11.3 cmake version 3.26.4
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 18 (4 by maintainers)
I think we should bring this issue back, iGPU offloading at least the prompt eval is very valuable
I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html
If you had a dedicated GPU, bringing down the prompt evaluation below 60s @ 1000 tokens is very much doable.
it has no vram. it’s just the ram being used as vram the bios does allocate some for it, but it’s more for legacy purpose afaik, it will just use whatever.
the good thing is that you don’t need to copy data vram->ram to access the data on cpu, it’s just always shared by both
Le mar. 13 juin 2023, 23:20, Sunija @.***> a écrit :
The provided Windows build with CLBlast using OpenCL should work but I wouldn’t expect any significant performance gains from integrated graphics.
Last I checked Intel MKL is a CPU only library. It will not use the IGP.
Also, AFAIK the “BLAS” part is only used for prompt processing. The actual text generation uses custom code for CPUs and accelerators.
You could load the IGP with clblast, but it might not actually speed things up because of the extra copies. There is not really a backend specifically targeting IGPs yet.
Yeah, the documentation is a bit lacking.