llama.cpp: Could not find tokenizer.model in llama2

When I ran this command:

python convert.py \
    llama2-summarizer-id-2/final_merged_checkpoint \
    --outtype f16 \
    --outfile llama2-summarizer-id-2/final_merged_checkpoint/llama2-summarizer-id-2.gguf.fp16.bin

I encountered the following error:

Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00001-of-00002.safetensors
Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00001-of-00002.safetensors
Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00002-of-00002.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-05, f_rope_freq_base=None, f_rope_scale=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('llama2-summarizer-id-2/final_merged_checkpoint'))
Traceback (most recent call last):
  File "llama.cpp/convert.py", line 1209, in <module>
    main()
  File "llama.cpp/convert.py", line 1191, in main
    vocab = load_vocab(vocab_dir, args.vocabtype)
  File "llama.cpp/convert.py", line 1092, in load_vocab
    raise FileNotFoundError(
FileNotFoundError: Could not find tokenizer.model in llama2-summarizer-id-2/final_merged_checkpoint or its parent; if it's in another directory, pass the directory as --vocab-dir

After training the llama2 model, I do not have a “tokenizer.model” file. Instead, the model directory contains the following files:

$ ls llama2-summarizer-id-2/final_merged_checkpoint/
config.json             model-00001-of-00002.safetensors  model.safetensors.index.json  tokenizer_config.json
generation_config.json  model-00002-of-00002.safetensors  special_tokens_map.json       tokenizer.json

What should I do to resolve this issue?

*note: i follow this tutorial for finetuning https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 1
  • Comments: 17

Most upvoted comments

Many (most) of the base models I’ve seen on Hugging Face do not have a file named tokenizer.model. So I am also having the same issue.

I have found a solution for this problem. The default vocabtype is ‘spm’ which invokes a Sentence Piece tokenizer. Some models utilize a Byte-Pair encoding (bpe) tokenizer. To convert a BPE-based model, use this syntax:

convert.py modelname_or_path --vocabtype bpe

just use the original one. if the tokenizer.model is in a different directory, you can use the --vocab-dir argument

I am here with the same problem trying to convert llama 3 70B. I don’t know what is meant by “go to huggingface and search the model, download the tokenizer separated” … there is no tokenizer.model on the llama3 70B page, and searching for it is turning up nothing. Where can I download the tokenizer for this?

He means from the the base model you fine tuned.

He means from the the base model you fine tuned.

Downloaded llama (all models) model from meta does not have tokenizer. I have same issue.

I have found a solution for this problem. The default vocabtype is ‘spm’ which invokes a Sentence Piece tokenizer. Some models utilize a Byte-Pair encoding (bpe) tokenizer. To convert a BPE-based model, use this syntax:

convert.py modelname_or_path --vocabtype bpe

–vocab-type

i finetuned the model into different language, will it still works?

I think it would depend on whether you made changes to the vocabulary in addition to training (like adding tokens, etc). If it was just training, then I believe it would work. I’m not 100% sure about this though.

just use the original one. if the tokenizer.model is in a different directory, you can use the --vocab-dir argument

what do you mean the original one? can you explain please?