llama.cpp: [Falcon] Attempting to run Falcon-180B Q5/6 give "illegal character"

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I’m attempting to run llama.cpp, latest master, with TheBloke’s Falcon 180B Q5/Q6 quantized GGUF models, but it errors out with “invalid character”. I’m unable to find any issues about this online anywhere. Another system of mind causes the same problem, and a buddy’s system does as well. llama.cpp functions normally on other models, such as Llama2, WizardLM, etc.

The downloaded GGUF file works with “text-generation-webui” so it is functioning, and verified as a good copy by others in the community.

Current Behavior

$ ./main -t 8 -m ../falcon-180b-chat.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "USER: Write a story about llamas. ASSISTANT:"
# ( OR any number of parameters, just -m <model> is enough )
...
< Many Tensors >
...
lama_model_loader: - tensor  640:          blk.79.attn_norm.weight f32      [ 14848,     1,     1,     1 ]
llama_model_loader: - tensor  641:           blk.79.ffn_down.weight q6_K     [ 59392, 14848,     1,     1 ]
llama_model_loader: - tensor  642:                 output_norm.bias f32      [ 14848,     1,     1,     1 ]                                                                                                                                   
llama_model_loader: - tensor  643:               output_norm.weight f32      [ 14848,     1,     1,     1 ]                                                                                                                                   
llama_model_loader: - kv   0:                       general.architecture str                                                                                                                                                                  
llama_model_loader: - kv   1:                               general.name str                               
llama_model_loader: - kv   2:                      falcon.context_length u32                                                                                                                                                                  
llama_model_loader: - kv   3:                  falcon.tensor_data_layout str                                           
llama_model_loader: - kv   4:                    falcon.embedding_length u32                                           
llama_model_loader: - kv   5:                 falcon.feed_forward_length u32                               
llama_model_loader: - kv   6:                         falcon.block_count u32     
llama_model_loader: - kv   7:                falcon.attention.head_count u32     
llama_model_loader: - kv   8:             falcon.attention.head_count_kv u32     
llama_model_loader: - kv   9:        falcon.attention.layer_norm_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:               general.quantization_version u32     
llama_model_loader: - type  f32:  322 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q5_K:  201 tensors
llama_model_loader: - type q6_K:  120 tensors
error loading model: invalid character
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../falcon-180b-chat.Q5_K_M.gguf'
main: error: unable to load model

Happy to provide longer output, but it was pretty standard model shapes/sizes ahead of the loader and error.

Environment and Context

Dell R740xd, 640GB RAM, Skylake processors Xeon Silver 4112 CPU @ 2.60GHz, Ubuntu Focal 20.04,

$ git log | head -1
commit 019ba1dcd0c7775a5ac0f7442634a330eb0173cc

$ shasum -a 256 ../falcon-180b-chat.Q5_K_M.gguf 
e49e65f34b807d7cdae33d91ce8bd7610f87cd534a2d17ef965c6cf6b03bf3d8  ../falcon-180b-chat.Q5_K_M.gguf

Please let me know if this is already known, I can’t seem to find it, and/or if I can help repo somehow. Thx

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 15 (7 by maintainers)

Most upvoted comments

I tried re-converting the model and it works. We have to put a notice in the readme hot topics

ggerganov on Oct 6, 2023

Falcon 40b is working for me, here is a scrip that should do the trick. Make sure you have Git LFS installed.

# From the root of llama.cpp
git clone https://huggingface.co/tiiuae/falcon-40b models/falcon-40b

pip3 install requirements.txt
pip3 install transformers torch

# convert to gguff
python3 convert-falcon-hf-to-gguf.py models/falcon-40b

# quantize
./quantize ./models/falcon-40b/ggml-model-f16.gguf ./models/falcon-40b/ggml-model-q4_0.gguf q4_0

# Profit

only-cliches on Oct 17, 2023