llama.cpp: Could not find tokenizer.model in llama2
When I ran this command:
python convert.py \
llama2-summarizer-id-2/final_merged_checkpoint \
--outtype f16 \
--outfile llama2-summarizer-id-2/final_merged_checkpoint/llama2-summarizer-id-2.gguf.fp16.bin
I encountered the following error:
Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00001-of-00002.safetensors
Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00001-of-00002.safetensors
Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00002-of-00002.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-05, f_rope_freq_base=None, f_rope_scale=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('llama2-summarizer-id-2/final_merged_checkpoint'))
Traceback (most recent call last):
File "llama.cpp/convert.py", line 1209, in <module>
main()
File "llama.cpp/convert.py", line 1191, in main
vocab = load_vocab(vocab_dir, args.vocabtype)
File "llama.cpp/convert.py", line 1092, in load_vocab
raise FileNotFoundError(
FileNotFoundError: Could not find tokenizer.model in llama2-summarizer-id-2/final_merged_checkpoint or its parent; if it's in another directory, pass the directory as --vocab-dir
After training the llama2 model, I do not have a “tokenizer.model” file. Instead, the model directory contains the following files:
$ ls llama2-summarizer-id-2/final_merged_checkpoint/
config.json model-00001-of-00002.safetensors model.safetensors.index.json tokenizer_config.json
generation_config.json model-00002-of-00002.safetensors special_tokens_map.json tokenizer.json
What should I do to resolve this issue?
*note: i follow this tutorial for finetuning https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Reactions: 1
- Comments: 17
Many (most) of the base models I’ve seen on Hugging Face do not have a file named tokenizer.model. So I am also having the same issue.
I have found a solution for this problem. The default vocabtype is ‘spm’ which invokes a Sentence Piece tokenizer. Some models utilize a Byte-Pair encoding (bpe) tokenizer. To convert a BPE-based model, use this syntax:
convert.py modelname_or_path --vocabtype bpe
just use the original one. if the
tokenizer.model
is in a different directory, you can use the--vocab-dir
argumentI am here with the same problem trying to convert llama 3 70B. I don’t know what is meant by “go to huggingface and search the model, download the tokenizer separated” … there is no tokenizer.model on the llama3 70B page, and searching for it is turning up nothing. Where can I download the tokenizer for this?
He means from the the base model you fine tuned.
Downloaded llama (all models) model from meta does not have tokenizer. I have same issue.
–vocab-type
I think it would depend on whether you made changes to the vocabulary in addition to training (like adding tokens, etc). If it was just training, then I believe it would work. I’m not 100% sure about this though.
what do you mean the original one? can you explain please?