private-gpt: Is privateGPT based on CPU or GPU? Why in my case it's unbelievably slow?

Does it have something to do with tensorflow? And it’s weird that from the following console messages,

  • It took PrivateGPT 51 seconds to answer 1 single question ???
  • Unable to register cuDNN/cuFFT/cuBLAS factory
  • This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.

Does that mean, I’m NOT using tensorflow-gpu? But ONLY tensorflow-CPU ???

➜  privateGPT git:(main) ✗ python privateGPT.py
2023-08-03 15:30:51.990327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-03 15:30:51.990368: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-03 15:30:51.990374: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-03 15:30:51.995080: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
Found model file at  ./models/ggml-gpt4all-j-v1.3-groovy.bin
gptj_model_load: loading model from './models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285

Enter a query: How are you man?
 I'm doing well, thank you for asking!

> Question:
How are you man?

> Answer (took 51.14 s.):
 I'm doing well, thank you for asking!

> source_documents/state_of_the_union.txt:
For more than two years, COVID-19 has impacted every decision in our lives and the life of the nation. 

And I know you’re tired, frustrated, and exhausted. 

But I also know this. 

Because of the progress we’ve made, because of your resilience and the tools we have, tonight I can say  
we are moving forward safely, back to more normal routines.  

We’ve reached a new moment in the fight against COVID-19, with severe cases down to a level not seen since last July.

> source_documents/state_of_the_union.txt:
For more than two years, COVID-19 has impacted every decision in our lives and the life of the nation. 

And I know you’re tired, frustrated, and exhausted. 

But I also know this. 

Because of the progress we’ve made, because of your resilience and the tools we have, tonight I can say  
we are moving forward safely, back to more normal routines.  

We’ve reached a new moment in the fight against COVID-19, with severe cases down to a level not seen since last July.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

Enter a query: 

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 59

Most upvoted comments

You need to install llama-cpp-python with GPU support

https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

then add n_gpu_layers=X to https://github.com/imartinez/privateGPT/blob/main/privateGPT.py#L36

eg,

            llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)

I am surprised there is not an env var in the python script to dynamically set GPU layers, but these were the steps I took to get my GPU using it. YMMV on the GPU layer count you can get away with offloading but I do the full 43 of llama 2 hermes 13b cuz I have a 3090 with 24G vram. Here is my output with all the above applied:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from REDCATED/models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32032
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2136.07 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 12209 MB
llama_new_context_with_model: kv self size  =  400.00 MB

Enter a query: 

Couple things:

  1. GPT4All I think is CPU only. At top of their repo (https://github.com/nomic-ai/gpt4all) they say “Open-source assistant-style large language models that run locally on your CPU” which is great for enabling literally anyone to get in on it, but not for GPU people. I could be wrong tho maybe there is some GPU support
  2. If you do use a GPU, you can use ggml models with llama-cpp-python in the way I offer.

Also, if you are running into tensorflow, or really any python issues… imo start with a fresh venv (https://docs.python.org/3/library/venv.html):

# Init
cd privateGPT/
python3 -m venv venv
source venv/bin/activate
# ... this is for if you have CUDA hardware, look up llama-cpp-python readme for the many ways to compile
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -r requirements.txt

# Run (notice `python` not `python3` now, venv introduces a new `python` command to PATH from `venv/bin`)
python privategpt.py

# Exit venv when you are done
deactivate

# Re-activate as needed
cd privateGPT/
source venv/bin/activate
python privategpt.py

Sorry if I created any confusion, hopefully the above is useful at least for people on Linux. lmk if this works or fails. seriously tho, if you have any python issues, imo its always best to start fresh than to fix anything. venv ftw!

@jiapei100 , looks like you have n_ctx set to 512 so thats way too small of a context, try n_ctx=4096 in the LlamaCpp initialization step for that specific model. And set max_tokens to like 512. Here is my line under model_type in privategpt.py and I think I set my batch to 512 for that hermes model but YMMV

llm = LlamaCpp(model_path=model_path, n_ctx=4096, max_tokens=512, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)

So, How much is the speed updated after implementing the GPU? @bioshazard Can you show me the query result?

https://github.com/PromtEngineer/localGPT trying this now as it seems to be pre-built for GPU use.

@JohnOstrowick looks like your llama-cpp-python was not compiled with GPU support (see the difference between my output and yours). Review my instruction for how to force it to install with cuBlas. Further, you might need to offload less layers than my 43/43 example as you only have 4G vram. I have 24G so I had room for all those layers. You will need to find the sweet spot. Right now your completions are being done on CPU.

It may be that if you paste my exact text in it will not do what you need. I expect if you provided the resulting context to chat GPT that it could guide you through what is wrong with the syntax of your result. Or if you paste the surrounding context here I can try to take a look at it to determine where the syntax error is. It might be a tab or a space or a missing colon or something.

@bioshazard Thanks for your kind answer.

The problem is fixed. I changed model to koala. It works now.

@johndev8964 2.4s after chroma db warms up! And again tho this is with nous-hermes-llama2-13b.ggmlv3.q6_K.bin so YMMV based on the model/GPU you choose.

> Question:
what is capital

> Answer (took 2.64 s.):
 In economics, capital refers to any man-made resource used in production or investment to create further goods or services. It can include physical assets like machinery or buildings as well as financial assets such as stocks and bonds. In the context of this passage, it appears that the author is specifically discussing "capital goods," which are durable items used in production processes, such as machines, tools, and equipment.

> source_documents/Man_Economy_and_State_with_Power_and_Market_Rothbard.epub:
There is another consideration that reinforces our conclusion. Professor Lachmann has been diligently reminding us of what economists generally forget: that “capital” is not just a homogeneous blob that can be added to or subtracted from. Capital is an intricate, delicate, interweaving structure of capital goods. All of the delicate strands of this structure have to fit, and fit precisely, or else malinvestment occurs. The free market is almost an automatic mechanism for such fitting; and we