localGPT: model inference is pretty slow

2023-08-20 14:20:27,502 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-20 14:20:27,502 - INFO - run_localGPT.py:181 - Display Source Documents set to: True
2023-08-20 14:20:27,690 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-20 14:20:30,007 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-20 14:20:30,011 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-20 14:20:30,014 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-20 14:20:30,019 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-20 14:20:30,046 - INFO - duckdb.py:460 - loaded in 144 embeddings
2023-08-20 14:20:30,047 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-20 14:20:30,048 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

Enter a query: please tell me the details of the second amendment 

llama_print_timings:        load time = 74359.92 ms
llama_print_timings:      sample time =    78.86 ms /   166 runs   (    0.48 ms per token,  2104.86 tokens per second)
llama_print_timings: prompt eval time = 74359.80 ms /  1109 tokens (   67.05 ms per token,    14.91 tokens per second)
llama_print_timings:        eval time = 41306.74 ms /   165 runs   (  250.34 ms per token,     3.99 tokens per second)
llama_print_timings:       total time = 116048.42 ms


> Question:
please tell me the details of the second amendment

> Answer:
 The Second Amendment to the United States Constitution states that "A well-regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed." This means that individuals have the right to own and carry firearms as part of a militia, which is a group of citizens who are trained and equipped to defend their state or country. The amendment does not explicitly prohibit the government from regulating or restricting the ownership of firearms in other contexts, such as for personal protection or hunting. However, the Supreme Court has interpreted this amendment to apply to all forms of gun ownership and use, and to limit any attempts by the government to restrict these rights.

GPU: Nvidia 3060 6 GB RAM: 16 GB

Is there any way to fix this? I thought llama.cpp is working on GPU but seems not? #390

About this issue

Original URL
State: open
Created 10 months ago
Comments: 29 (15 by maintainers)

Most upvoted comments

So this means no layers were put on gpu, but at least it recognized the gpu now.
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/35 layers to GPU
I’ve never worked with webui, and it’s really not for a discussion on this repo, but try:
echo "--n-gpu-layers 10" >> CMD_FLAGS.txt
bash start_linux.sh 
lmk how that goes, and what is the output of run_localGPT?

thanks, saw 10 layers got offloaded to GPU! but this shell script is for webUI, it’s not gonna affect run_localGPT right?

bash start_linux.sh 
2023-08-22 15:35:07 INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
2023-08-22 15:35:31 INFO:Loading TheBloke_Llama-2-7B-GGML...
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
2023-08-22 15:35:33 INFO:llama.cpp weights detected: models/TheBloke_Llama-2-7B-GGML/llama-2-7b.ggmlv3.q4_K_M.bin
2023-08-22 15:35:33 INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models/TheBloke_Llama-2-7B-GGML/llama-2-7b.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 3112.13 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1562 MB
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
2023-08-22 15:35:35 INFO:Loaded the model in 3.78 seconds.

Output generated in 4.55 seconds (2.20 tokens/s, 10 tokens, context 66, seed 1608300543)
Llama.generate: prefix-match hit
Output generated in 17.95 seconds (5.46 tokens/s, 98 tokens, context 93, seed 975149063)

shao-shuai on Aug 22, 2023