localGPT: model inference is pretty slow
2023-08-20 14:20:27,502 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-20 14:20:27,502 - INFO - run_localGPT.py:181 - Display Source Documents set to: True
2023-08-20 14:20:27,690 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length 512
2023-08-20 14:20:30,007 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-20 14:20:30,011 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-20 14:20:30,014 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-20 14:20:30,019 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-20 14:20:30,046 - INFO - duckdb.py:460 - loaded in 144 embeddings
2023-08-20 14:20:30,047 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-20 14:20:30,048 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Enter a query: please tell me the details of the second amendment
llama_print_timings: load time = 74359.92 ms
llama_print_timings: sample time = 78.86 ms / 166 runs ( 0.48 ms per token, 2104.86 tokens per second)
llama_print_timings: prompt eval time = 74359.80 ms / 1109 tokens ( 67.05 ms per token, 14.91 tokens per second)
llama_print_timings: eval time = 41306.74 ms / 165 runs ( 250.34 ms per token, 3.99 tokens per second)
llama_print_timings: total time = 116048.42 ms
> Question:
please tell me the details of the second amendment
> Answer:
The Second Amendment to the United States Constitution states that "A well-regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed." This means that individuals have the right to own and carry firearms as part of a militia, which is a group of citizens who are trained and equipped to defend their state or country. The amendment does not explicitly prohibit the government from regulating or restricting the ownership of firearms in other contexts, such as for personal protection or hunting. However, the Supreme Court has interpreted this amendment to apply to all forms of gun ownership and use, and to limit any attempts by the government to restrict these rights.
GPU: Nvidia 3060 6 GB RAM: 16 GB
Is there any way to fix this? I thought llama.cpp is working on GPU but seems not? #390
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 29 (15 by maintainers)
thanks, saw 10 layers got offloaded to GPU! but this shell script is for webUI, it’s not gonna affect run_localGPT right?