private-gpt: Is privateGPT based on CPU or GPU? Why in my case it's unbelievably slow?
Does it have something to do with tensorflow? And it’s weird that from the following console messages,
- It took PrivateGPT 51 seconds to answer 1 single question ???
- Unable to register cuDNN/cuFFT/cuBLAS factory
- This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
Does that mean, I’m NOT using tensorflow-gpu? But ONLY tensorflow-CPU ???
➜ privateGPT git:(main) ✗ python privateGPT.py
2023-08-03 15:30:51.990327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-03 15:30:51.990368: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-03 15:30:51.990374: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-03 15:30:51.995080: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
Found model file at ./models/ggml-gpt4all-j-v1.3-groovy.bin
gptj_model_load: loading model from './models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size = 896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size = 3609.38 MB / num tensors = 285
Enter a query: How are you man?
I'm doing well, thank you for asking!
> Question:
How are you man?
> Answer (took 51.14 s.):
I'm doing well, thank you for asking!
> source_documents/state_of_the_union.txt:
For more than two years, COVID-19 has impacted every decision in our lives and the life of the nation.
And I know you’re tired, frustrated, and exhausted.
But I also know this.
Because of the progress we’ve made, because of your resilience and the tools we have, tonight I can say
we are moving forward safely, back to more normal routines.
We’ve reached a new moment in the fight against COVID-19, with severe cases down to a level not seen since last July.
> source_documents/state_of_the_union.txt:
For more than two years, COVID-19 has impacted every decision in our lives and the life of the nation.
And I know you’re tired, frustrated, and exhausted.
But I also know this.
Because of the progress we’ve made, because of your resilience and the tools we have, tonight I can say
we are moving forward safely, back to more normal routines.
We’ve reached a new moment in the fight against COVID-19, with severe cases down to a level not seen since last July.
> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Enter a query:
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 59
You need to install
llama-cpp-python
with GPU supporthttps://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal
then add
n_gpu_layers=X
to https://github.com/imartinez/privateGPT/blob/main/privateGPT.py#L36eg,
I am surprised there is not an env var in the python script to dynamically set GPU layers, but these were the steps I took to get my GPU using it. YMMV on the GPU layer count you can get away with offloading but I do the full 43 of llama 2 hermes 13b cuz I have a 3090 with 24G vram. Here is my output with all the above applied:
Couple things:
llama-cpp-python
in the way I offer.Also, if you are running into tensorflow, or really any python issues… imo start with a fresh
venv
(https://docs.python.org/3/library/venv.html):Sorry if I created any confusion, hopefully the above is useful at least for people on Linux. lmk if this works or fails. seriously tho, if you have any python issues, imo its always best to start fresh than to fix anything.
venv
ftw!@jiapei100 , looks like you have n_ctx set to 512 so thats way too small of a context, try
n_ctx=4096
in theLlamaCpp
initialization step for that specific model. And set max_tokens to like 512. Here is my line under model_type in privategpt.py and I think I set my batch to 512 for that hermes model but YMMVSo, How much is the speed updated after implementing the GPU? @bioshazard Can you show me the query result?
https://github.com/PromtEngineer/localGPT trying this now as it seems to be pre-built for GPU use.
@JohnOstrowick looks like your llama-cpp-python was not compiled with GPU support (see the difference between my output and yours). Review my instruction for how to force it to install with cuBlas. Further, you might need to offload less layers than my 43/43 example as you only have 4G vram. I have 24G so I had room for all those layers. You will need to find the sweet spot. Right now your completions are being done on CPU.
It may be that if you paste my exact text in it will not do what you need. I expect if you provided the resulting context to chat GPT that it could guide you through what is wrong with the syntax of your result. Or if you paste the surrounding context here I can try to take a look at it to determine where the syntax error is. It might be a tab or a space or a missing colon or something.
@bioshazard Thanks for your kind answer.
The problem is fixed. I changed model to koala. It works now.
@johndev8964 2.4s after chroma db warms up! And again tho this is with
nous-hermes-llama2-13b.ggmlv3.q6_K.bin
so YMMV based on the model/GPU you choose.