llama-cpp-python: LLama cpp problem ( gpu support)

Hello, I am completly newbie, when it comes to the subject of llms I install some ggml model to oogabooga webui And I try to use it. It works fine, but only for RAM. For VRAM only uses 0.5gb, and I don’t have any possibility to change it (offload some layers to GPU), even pasting in webui line “–n-gpu-layers 10” dont work. So I stareted searching, one of answers is command:

pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

But that dont work for me. I got after paste it:

 [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

And it completly broke llama folder… It uninstall it, and did nothing more. I need to update webui to fix and download llama.cpp again, cause I don’t have any other possibility to download it.

I try also downloading compilation method, but that did.t work also. When i paste CMAKE_ARGS=“-DLLAMA_OPENBLAS=on” FORCE_CMAKE=1 pip install llama-cpp-python in CMD/ CMD Windows in oogabooga, a I always got this message:

'CMAKE_ARGS' is not recognized as an internal or external command,
operable program or batch file.

or

'FORCE_CMAKE' is not recognized as an internal or external command,
operable program or batch file.

Same for command “make” it unrecognised it despite I have istalled make and Cmake

also, when i lanuch webui and choose ggml model, I got something like this in console:

lama model load internal: format ggjt v3 (latest) 
lama model load internal: n_voc = 32001 
lama model load internal: n_ctx = 2048 
lama model load internal: n_embd = 6656 
lama model load internal: n mult = 256 
lama model load internal: n head = 52 
lama model load internal: n_layer = 60 
lama model load internal: n_rot = 128 
lama model load internal: freq_base = 10000.0 
lama model load internal: freq_scale = 1 
lama model load internal: ftype = 2 (mostly Q4_0) 
lama model load internal: n_ff = 17920 
lama model load internal: model size = 30B 
lama model_load internal: ggml ctx size = 0.14 MB 
lama_model_load internal: mem required = 19712.68 MB 1+ 3124.00 MB per state) 
lama_new_context with model: kv self size = 3120.00 MB
AVX=1 | AVX2=1 | AVX512=0 | AVX512_VBMI=0 | AVX512_VNNII=0 | FMA=1 | NEON=0 | ARM_FMA=0 | F16C=1 | FP16_VA=0 | - a WASM_SIMD=0 | BLAS=0 | SSE3=1 1 | VSX=0 |
2023.07.19 23:05:22 INFO:Loaded the model in 8.17 Seconds. 

I am using windows and nvidia card

Easy solution to enable GPU offlading layers, that dont reqiure installing a ton of stuffs?

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 20

Commits related to this issue

Most upvoted comments

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export PATH=${CUDA_HOME}/bin:$PATH
export LLAMA_CUBLAS=on
make clean
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggergabiv/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export LLAMA_CUBLAS=on
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

Thanks @glaudiston ,It’s working perfect

I Followed all these steps but i am facing this issue i am using llama-cpp-python from langchain export LLAMA_CPP_LIB=/path/to/your/libllama.so RuntimeError: Failed to load shared library ‘/home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so’: /home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so: undefined symbol: ggml_cuda_assign_buffers_force_inplace I am using Ubuntu 22.04 Is anyone else facing the same issue ?

This is probably due a dirty build. That symbol is generated only when building with GPU support. Try a

make clean

Also make sure nvcc is in your path, by setting ${CUDA_HOME} in your PATH environment variable and try again. And try again.

Thanks for your kind response , i used your advice ,

and got it working by reinstalling llama-cpp-python with these variables CMAKE_ARGS=“-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.2 -DCUDAToolkit_ROOT=/usr/local/cuda-12.2 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.2/lib64 -DCMAKE_CUDA_COMPILER:PATH=/usr/local/cuda/bin/nvcc” FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir --verbose

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggergabiv/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export LLAMA_CUBLAS=on
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

Thanks @glaudiston ,It’s working perfect

Thanks @glaudiston !!!

Well I just wanted to run llama-cpp-python from miniconda3 env from https://github.com/oobabooga/text-generation-webui

In that case you can only use export LLAMA_CPP_LIB=/yourminicondapath/miniconda3/lib/python3.10/site-packages/llama_cpp_cuda/libllama.so Before running your jupyter-notebook, ipython or python or whatever. In my case I added to my .bashrc.

Voilà!!!!

On importing from llama_cpp import Llama I get

ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1

And on

llm = Llama(model_path="/mnt/LxData/llama.cpp/models/meta-llama2/llama-2-7b-chat/ggml-model-q4_0.bin", 
            n_gpu_layers=28, n_threads=6, n_ctx=3584, n_batch=521, verbose=True), 

… llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 2381.32 MB (+ 1026.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 480 MB VRAM for the scratch buffer llama_model_load_internal: offloading 28 repeating layers to GPU llama_model_load_internal: offloaded 28/35 layers to GPU llama_model_load_internal: total VRAM used: 3521 MB …

Thanks @glaudiston . The llama.cpp lib works absolutely fine with my GPU, so it’s odd that the python binding is failing.

I was able to make it work using LLAMA_CPP_LIB pointing to a libllama.so file compiled with GGML_USE_CUBLAS.

This method worked for me.

First, install using:

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
!git clone https://github.com/ggerganov/llama.cpp.git

Then install the nvidia cuda toolkit again if it shows errrors related to cuda:

!sudo apt install nvidia-cuda-toolkit

Sorry, if I am using windows what procedure should I follow to be able to use the GPU with Llama.Cpp I would think the procedure varies. Thank you very much for any help.

You can use WSL2 on Windows, and it should work as if you were using Linux.

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggergabiv/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export LLAMA_CUBLAS=on
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

Thanks @glaudiston ,It’s working perfect

I Followed all these steps but i am facing this issue i am using llama-cpp-python from langchain

export LLAMA_CPP_LIB=/path/to/your/libllama.so RuntimeError: Failed to load shared library ‘/home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so’: /home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so: undefined symbol: ggml_cuda_assign_buffers_force_inplace

I am using Ubuntu 22.04

Is anyone else facing the same issue ?

This is probably due a dirty build. That symbol is generated only when building with GPU support. Try a

make clean

Also make sure nvcc is in your path, by setting ${CUDA_HOME} in your PATH environment variable and try again. And try again.