CASALIOY: Custom Model giving error - ValueError: Requested tokens exceed context window of 512
Error Stack Trace
llama.cpp: loading model from models/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size = 512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from models/ggml-vic-7b-uncensored.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 68.20 KB
llama_model_load_internal: mem required = 5809.34 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Enter a query: hi
llama_print_timings: load time = 2116.68 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 2109.54 ms / 2 tokens ( 1054.77 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 2118.39 ms
Traceback (most recent call last):
File "/home/user/CASALIOY/customLLM.py", line 54, in <module>
main()
File "/home/user/CASALIOY/customLLM.py", line 39, in main
res = qa(query)
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/base.py", line 140, in __call__
raise e
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/base.py", line 134, in __call__
self._call(inputs, run_manager=run_manager)
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/retrieval_qa/base.py", line 120, in _call
answer = self.combine_documents_chain.run(
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/base.py", line 239, in run
return self(kwargs, callbacks=callbacks)[self.output_keys[0]]
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/base.py", line 140, in __call__
raise e
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/base.py", line 134, in __call__
self._call(inputs, run_manager=run_manager)
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/combine_documents/base.py", line 84, in _call
output, extra_return_dict = self.combine_docs(
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/combine_documents/stuff.py", line 87, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/llm.py", line 213, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/base.py", line 140, in __call__
raise e
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/base.py", line 134, in __call__
self._call(inputs, run_manager=run_manager)
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/llm.py", line 69, in _call
response = self.generate([inputs], run_manager=run_manager)
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/chains/llm.py", line 79, in generate
return self.llm.generate_prompt(
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/llms/base.py", line 127, in generate_prompt
return self.generate(prompt_strings, stop=stop, callbacks=callbacks)
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/llms/base.py", line 176, in generate
raise e
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/llms/base.py", line 170, in generate
self._generate(prompts, stop=stop, run_manager=run_manager)
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/llms/base.py", line 377, in _generate
self._call(prompt, stop=stop, run_manager=run_manager)
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/llms/llamacpp.py", line 228, in _call
for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager):
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/langchain/llms/llamacpp.py", line 277, in stream
for chunk in result:
File "/root/miniconda3/envs/vicuna/lib/python3.9/site-packages/llama_cpp/llama.py", line 602, in _create_completion
raise ValueError(
ValueError: Requested tokens exceed context window of 512
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (13 by maintainers)
Related: https://github.com/hwchase17/langchain/issues/2645
Quick fix: remove
n_ctx = 256, max_tokens = 256and changechain_type="stuff"tochain_type="refine"Can you open a new issue and share more detail (env, prompt, document) ?
seems like error is fixed with the new release for now. But I can not stop the model to stop talking on its own. How to do that?
Btw, the original startLLM.py did not work for me. Was throwing syntax error. So , using self modified below version
My .env file -
Tried everything - lowered temp, changed stuff to refine or something else. model does not stop talking immediately. It outputs a self thought chain for a large para, then it stops.
This got me past that error, and then got this error -
Also, seems like there is no proper stop and start, hence the agent is in a continuous loop of Q&A till it encounters error.
The customLLM.py might be deprecated. I won’t include it in the production release. I Instead adding Custom Support to the main startLLM with supported version of LlamaCpp
Keep me posted and thanks for your insights. Maybe we should opt in for a docker release too.