autotrain-advanced: Finetuned Model does not have a config.json

When we finetune a llm using auto-trained advanced, it does not store a config.json which makes it difficult to load.

It has the following files

README.md
adapter_config.json
adapter_model.bin
optimizer.pt
pytorch_model.bin
Mo_state.pth
scheduler.pt
special_tokens_map json
tokenizerjson
tokenizer_configjson
trainer_state.json
training_args.bin

So when I load it using pipeline, or by default class, it fails.

i.e

# use pipeline to check

import torch
from transformers import pipeline


dolly_llm = pipeline(model="/content/dolly_proj/dolly_v2/checkpoint-150", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

Error:


OSError: dolly_v2/checkpoint-225 does not appear to have a file named config.json. Checkout 'https://huggingface.co/dolly_v2/checkpoint-225/None' for available files.

import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dolly_v2/checkpoint-150", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("dolly_v2/checkpoint-150", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

Error:

OSError: dolly_v2/checkpoint-225 does not appear to have a file named config.json. Checkout 'https://huggingface.co/dolly_v2/checkpoint-225/None' for available files.

Training Command:

!autotrain llm --train --project_name dolly_v2 --model databricks/dolly-v2-3b --data_path . --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 6 --num_train_epochs 3 --trainer sft

It should have the config.json by default

Manually adding config.json and instruct_pipeline from the dollyv2 repo and then loading gives the following warning:

Some weights of GPTNeoXForCausalLM were not initialized from the model checkpoint at /dolly_v2/checkpoint-225 and are newly initialized: ['layers.12.mlp.dense_4h_to_h.bias', 'layers.0.attention.query_key_value.weight', 'layers.5.post_attention_layernorm.bias', 'layers.16.attention.dense.bias', 'layers.16.mlp.dense_h_to_4h.weight', 'layers.9.attention.query_key_value.bias', 'layers.3.mlp.dense_h_to_4h.weight', 'layers.9.post_attention_layernorm.bias', 'layers.1.attention.dense.weight', 'layers.22.attention.dense.bias', 'layers.8.attention.query_key_value.bias', 'layers.5.mlp.dense_4h_to_h.weight', 'layers.13.input_layernorm.weight', 'layers.14.attention.query_key_value.bias', 'layers.30.mlp.dense_h_to_4h.bias', 'layers.20.mlp.dense_h_to_4h.bias', 'layers.22.mlp.dense_h_to_4h.bias', 'layers.3.attention.rotary_emb.inv_freq', 'layers.22.mlp.dense_4h_to_h.bias', 'layers.23.input_layernorm.bias', 'layers.9.attention.dense.bias', 'layers.7.input_layernorm.bias', 'layers.19.mlp.dense_4h_to_h.weight', 'layers.2.attention.dense.weight', 'layers.2.input_layernorm.weight', 'layers.0.input_layernorm.bias', 'layers.25.post_attention_layernorm.bias', 'layers.6.attention.query_key_value.weight', 'layers.27.post_attention_layernorm.weight', 'layers.24.mlp.dense_h_to_4h.weight', 'layers.13.mlp.dense_h_to_4h.bias', 'layers.17.mlp.dense_4h_to_h.weight', 'layers.16.attention.dense.weight', 'layers.6.attention.query_key_value.bias', 'layers.14.post_attention_layernorm.weight', 'layers.11.mlp.dense_h_to_4h.weight', 'layers.14.mlp.dense_4h_to_h.weight', 'layers.28.post_attention_layernorm.bias', 'layers.27.attention.dense.weight', 'layers.25.input_layernorm.bias', 'layers.11.mlp.dense_h_to_4h.bias', 'layers.6.mlp.dense_4h_to_h.bias', 'layers.5.attention.query_key_value.bias', 'layers.21.attention.dense.bias', 'layers.8.post_attention_layernorm.weight', 'layers.17.attention.dense.bias', 'layers.29.mlp.dense_4h_to_h.weight', 'layers.15.mlp.dense_h_to_4h.bias', 'layers.7.input_layernorm.weight', 'layers.4.input_layernorm.bias', 'layers.20.attention.dense.bias', 'layers.15.post_attention_layernorm.weight', 'layers.23.mlp.dense_h_to_4h.weight', 'layers.3.mlp.dense_4h_to_h.weight', 'layers.4.attention.dense.weight', 'layers.12.mlp.dense_4h_to_h.weight', 'layers.27.mlp.dense_h_to_4h.weight', 'layers.3.post_attention_layernorm.weight', 'layers.28.attention.dense.bias', 'layers.23.post_attention_layernorm.bias', 'layers.23.mlp.dense_h_to_4h.bias', 'layers.0.attention.dense.bias', 'layers.10.post_attention_layernorm.bias', 'layers.24.attention.query_key_value.bias', 'layers.26.post_attention_layernorm.bias', 'layers.18.attention.dense.weight', 'layers.31.input_layernorm.bias', 'layers.16.input_layernorm.weight', 'layers.12.mlp.dense_h_to_4h.weight', 'layers.13.attention.dense.bias', 'layers.9.attention.dense.weight', 'layers.22.mlp.dense_h_to_4h.weight', 'layers.8.attention.dense.bias', 'layers.25.attention.query_key_value.bias', 'layers.12.input_layernorm.weight', 'layers.16.post_attention_layernorm.bias', 'layers.19.attention.query_key_value.bias', 'layers.0.input_layernorm.weight', 'layers.26.input_layernorm.weight', 'layers.5.input_layernorm.weight', 'layers.24.post_attention_layernorm.bias', 'layers.17.post_attention_layernorm.bias', 'layers.3.attention.dense.weight', 'layers.6.attention.dense.weight', 'layers.8.input_layernorm.bias', 'layers.17.attention.rotary_emb.inv_freq', 'layers.25.mlp.dense_4h_to_h.weight', 'layers.22.input_layernorm.bias', 'layers.12.post_attention_layernorm.bias', 'layers.21.mlp.dense_h_to_4h.weight', 'layers.10.attention.query_key_value.weight', 'layers.2.input_layernorm.bias', 'layers.28.input_layernorm.bias', 'layers.1.post_attention_layernorm.bias', 'layers.27.mlp.dense_4h_to_h.bias', 'layers.0.post_attention_layernorm.bias', 'layers.13.input_layernorm.bias', 'layers.28.attention.query_key_value.weight', 'layers.20.attention.dense.weight', 'layers.2.mlp.dense_h_to_4h.bias', 'layers.4.post_attention_layernorm.bias', 'layers.20.attention.query_key_value.weight', 'layers.23.attention.query_key_value.bias', 'layers.21.attention.dense.weight', 'layers.21.attention.query_key_value.bias', 'layers.1.attention.rotary_emb.inv_freq', 'layers.11.attention.rotary_emb.inv_freq', 'layers.18.mlp.dense_4h_to_h.bias', 'layers.6.attention.rotary_emb.inv_freq', 'layers.13.post_attention_layernorm.bias', 'layers.0.mlp.dense_h_to_4h.bias', 'layers.26.post_attention_layernorm.weight', 'layers.10.mlp.dense_h_to_4h.bias', 'layers.16.mlp.dense_4h_to_h.weight', 'layers.20.post_attention_layernorm.bias', 'layers.30.post_attention_layernorm.weight', 'layers.12.attention.dense.weight', 'layers.3.attention.dense.bias', 'layers.28.mlp.dense_h_to_4h.bias', 'layers.16.attention.query_key_value.weight', 'layers.26.mlp.dense_4h_to_h.weight', 'layers.19.post_attention_layernorm.weight', 'layers.12.mlp.dense_h_to_4h.bias', 'layers.1.input_layernorm.bias', 'layers.26.mlp.dense_4h_to_h.bias', 'layers.12.attention.query_key_value.weight', 'layers.24.mlp.dense_h_to_4h.bias', 'layers.30.attention.query_key_value.weight', 'layers.1.mlp.dense_h_to_4h.bias', 'layers.19.input_layernorm.bias', 'layers.31.input_layernorm.weight', 'layers.3.attention.query_key_value.weight', 'layers.23.attention.query_key_value.weight', 'layers.23.attention.rotary_emb.inv_freq', 'layers.3.mlp.dense_h_to_4h.bias', 'layers.24.post_attention_layernorm.weight', 'layers.28.mlp.dense_4h_to_h.weight', 'layers.17.attention.dense.weight', 'layers.0.attention.rotary_emb.inv_freq', 'layers.23.input_layernorm.weight', 'layers.24.attention.query_key_value.weight', 'layers.8.attention.rotary_emb.inv_freq', 'layers.22.input_layernorm.weight', 'layers.10.post_attention_layernorm.weight', 'layers.18.post_attention_layernorm.weight', 'layers.8.attention.query_key_value.weight', 'layers.31.mlp.dense_h_to_4h.bias', 'layers.5.attention.rotary_emb.inv_freq', 'layers.7.attention.query_key_value.bias', 'layers.5.input_layernorm.bias', 'layers.10.mlp.dense_h_to_4h.weight', 'layers.11.input_layernorm.bias', 'layers.7.mlp.dense_4h_to_h.bias', 'layers.19.mlp.dense_h_to_4h.weight', 'layers.4.input_layernorm.weight', 'layers.13.attention.rotary_emb.inv_freq', 'layers.20.post_attention_layernorm.weight', 'layers.17.mlp.dense_4h_to_h.bias', 'layers.4.mlp.dense_h_to_4h.weight', 'layers.15.post_attention_layernorm.bias', 'layers.0.attention.query_key_value.bias', 'layers.24.attention.dense.weight', 'layers.1.mlp.dense_h_to_4h.weight', 'layers.15.input_layernorm.weight', 'layers.28.mlp.dense_4h_to_h.bias', 'layers.30.attention.dense.bias', 'layers.7.attention.query_key_value.weight', 'layers.9.attention.rotary_emb.inv_freq', 'layers.9.attention.query_key_value.weight', 'layers.28.input_layernorm.weight', 'layers.1.attention.dense.bias', 'layers.3.post_attention_layernorm.bias', 'layers.25.mlp.dense_h_to_4h.bias', 'layers.6.mlp.dense_h_to_4h.weight', 'layers.26.input_layernorm.bias', 'layers.21.mlp.dense_4h_to_h.bias', 'layers.19.input_layernorm.weight', 'layers.14.input_layernorm.bias', 'layers.8.input_layernorm.weight', 'layers.19.attention.dense.weight', 'layers.6.input_layernorm.bias', 'layers.31.attention.query_key_value.weight', 'layers.26.attention.rotary_emb.inv_freq', 'layers.13.mlp.dense_4h_to_h.bias', 'layers.13.mlp.dense_h_to_4h.weight', 'layers.22.mlp.dense_4h_to_h.weight', 'layers.30.mlp.dense_4h_to_h.bias', 'layers.24.mlp.dense_4h_to_h.weight', 'layers.27.mlp.dense_h_to_4h.bias', 'layers.24.mlp.dense_4h_to_h.bias', 'layers.9.mlp.dense_h_to_4h.weight', 'layers.7.attention.dense.bias', 'layers.2.post_attention_layernorm.weight', 'layers.5.attention.dense.bias', 'layers.20.mlp.dense_4h_to_h.weight', 'layers.15.attention.query_key_value.weight', 'layers.7.mlp.dense_h_to_4h.bias', 'layers.0.mlp.dense_4h_to_h.weight', 'layers.11.attention.dense.bias', 'layers.7.attention.dense.weight', 'layers.16.mlp.dense_h_to_4h.bias', 'layers.29.mlp.dense_h_to_4h.weight', 'layers.28.attention.query_key_value.bias', 'layers.9.input_layernorm.bias', 'layers.15.input_layernorm.bias', 'layers.11.input_layernorm.weight', 'layers.22.attention.query_key_value.bias', 'layers.22.attention.query_key_value.weight', 'layers.26.attention.dense.weight', 'layers.2.mlp.dense_4h_to_h.weight', 'layers.15.attention.dense.weight', 'layers.26.mlp.dense_h_to_4h.weight', 'layers.31.post_attention_layernorm.bias', 'layers.19.mlp.dense_h_to_4h.bias', 'layers.17.post_attention_layernorm.weight', 'layers.30.attention.query_key_value.bias', 'layers.29.input_layernorm.bias', 'layers.18.mlp.dense_h_to_4h.bias', 'layers.21.mlp.dense_h_to_4h.bias', 'layers.25.post_attention_layernorm.weight', 'layers.14.attention.dense.weight', 'layers.15.attention.rotary_emb.inv_freq', 'layers.21.input_layernorm.weight', 'layers.0.mlp.dense_h_to_4h.weight', 'layers.12.attention.dense.bias', 'layers.2.attention.rotary_emb.inv_freq', 'layers.11.attention.query_key_value.bias', 'layers.30.mlp.dense_h_to_4h.weight', 'layers.29.post_attention_layernorm.weight', 'layers.18.attention.query_key_value.weight', 'layers.14.attention.query_key_value.weight', 'layers.7.attention.rotary_emb.inv_freq', 'layers.12.attention.rotary_emb.inv_freq', 'layers.19.attention.rotary_emb.inv_freq', 'layers.21.mlp.dense_4h_to_h.weight', 'layers.8.mlp.dense_h_to_4h.weight', 'layers.27.attention.rotary_emb.inv_freq', 'layers.7.mlp.dense_4h_to_h.weight', 'layers.8.mlp.dense_4h_to_h.bias', 'layers.14.input_layernorm.weight', 'layers.2.mlp.dense_4h_to_h.bias', 'layers.17.mlp.dense_h_to_4h.bias', 'layers.4.mlp.dense_4h_to_h.weight', 'layers.15.mlp.dense_4h_to_h.bias', 'layers.31.attention.dense.bias', 'layers.19.mlp.dense_4h_to_h.bias', 'layers.22.attention.rotary_emb.inv_freq', 'layers.8.mlp.dense_4h_to_h.weight', 'layers.10.attention.dense.bias', 'layers.19.attention.query_key_value.weight', 'layers.3.input_layernorm.bias', 'layers.10.input_layernorm.bias', 'layers.25.input_layernorm.weight', 'layers.31.attention.dense.weight', 'layers.23.attention.dense.bias', 'layers.19.attention.dense.bias', 'layers.20.mlp.dense_h_to_4h.weight', 'layers.21.post_attention_layernorm.bias', 'layers.3.attention.query_key_value.bias', 'layers.0.attention.dense.weight', 'layers.28.attention.dense.weight', 'layers.30.attention.rotary_emb.inv_freq', 'layers.8.post_attention_layernorm.bias', 'layers.6.input_layernorm.weight', 'layers.16.attention.query_key_value.bias', 'layers.15.attention.dense.bias', 'layers.30.attention.dense.weight', 'layers.17.attention.query_key_value.bias', 'layers.31.mlp.dense_4h_to_h.bias', 'layers.25.mlp.dense_4h_to_h.bias', 'layers.24.input_layernorm.bias', 'layers.26.attention.query_key_value.weight', 'layers.29.attention.rotary_emb.inv_freq', 'layers.7.post_attention_layernorm.weight', 'layers.24.attention.rotary_emb.inv_freq', 'layers.13.post_attention_layernorm.weight', 'layers.29.mlp.dense_4h_to_h.bias', 'layers.17.input_layernorm.bias', 'layers.13.mlp.dense_4h_to_h.weight', 'layers.20.input_layernorm.bias', 'layers.0.post_attention_layernorm.weight', 'layers.31.mlp.dense_h_to_4h.weight', 'layers.14.mlp.dense_4h_to_h.bias', 'layers.27.post_attention_layernorm.bias', 'layers.18.attention.rotary_emb.inv_freq', 'layers.29.input_layernorm.weight', 'layers.24.attention.dense.bias', 'layers.7.mlp.dense_h_to_4h.weight', 'layers.18.mlp.dense_4h_to_h.weight', 'layers.21.post_attention_layernorm.weight', 'layers.30.input_layernorm.bias', 'layers.25.attention.query_key_value.weight', 'layers.1.mlp.dense_4h_to_h.weight', 'layers.8.attention.dense.weight', 'layers.10.mlp.dense_4h_to_h.weight', 'layers.1.post_attention_layernorm.weight', 'layers.2.attention.dense.bias', 'layers.15.mlp.dense_h_to_4h.weight', 'layers.29.attention.query_key_value.weight', 'layers.9.mlp.dense_4h_to_h.bias', 'layers.14.mlp.dense_h_to_4h.weight', 'layers.2.attention.query_key_value.bias', 'layers.29.mlp.dense_h_to_4h.bias', 'layers.25.mlp.dense_h_to_4h.weight', 'layers.2.mlp.dense_h_to_4h.weight', 'layers.25.attention.dense.weight', 'layers.25.attention.dense.bias', 'layers.10.input_layernorm.weight', 'layers.28.mlp.dense_h_to_4h.weight', 'layers.5.attention.dense.weight', 'layers.4.mlp.dense_4h_to_h.bias', 'embed_in.weight', 'layers.5.attention.query_key_value.weight', 'layers.4.attention.query_key_value.bias', 'layers.4.attention.dense.bias', 'layers.1.attention.query_key_value.bias', 'layers.4.mlp.dense_h_to_4h.bias', 'layers.31.post_attention_layernorm.weight', 'layers.28.post_attention_layernorm.weight', 'layers.12.attention.query_key_value.bias', 'layers.29.attention.query_key_value.bias', 'layers.31.attention.rotary_emb.inv_freq', 'layers.31.attention.query_key_value.bias', 'layers.6.attention.dense.bias', 'layers.3.mlp.dense_4h_to_h.bias', 'layers.10.mlp.dense_4h_to_h.bias', 'layers.22.post_attention_layernorm.bias', 'layers.16.post_attention_layernorm.weight', 'layers.9.mlp.dense_4h_to_h.weight', 'layers.16.input_layernorm.bias', 'layers.17.attention.query_key_value.weight', 'layers.11.attention.dense.weight', 'layers.23.post_attention_layernorm.weight', 'layers.13.attention.query_key_value.weight', 'layers.10.attention.rotary_emb.inv_freq', 'layers.25.attention.rotary_emb.inv_freq', 'layers.16.mlp.dense_4h_to_h.bias', 'layers.11.mlp.dense_4h_to_h.weight', 'layers.13.attention.dense.weight', 'embed_out.weight', 'layers.12.post_attention_layernorm.weight', 'layers.20.attention.query_key_value.bias', 'layers.21.input_layernorm.bias', 'layers.18.attention.dense.bias', 'layers.30.input_layernorm.weight', 'layers.26.attention.query_key_value.bias', 'layers.1.attention.query_key_value.weight', 'layers.26.mlp.dense_h_to_4h.bias', 'layers.5.mlp.dense_h_to_4h.weight', 'layers.22.post_attention_layernorm.weight', 'layers.6.post_attention_layernorm.bias', 'layers.6.mlp.dense_h_to_4h.bias', 'layers.19.post_attention_layernorm.bias', 'layers.17.input_layernorm.weight', 'layers.23.mlp.dense_4h_to_h.weight', 'layers.9.input_layernorm.weight', 'layers.20.mlp.dense_4h_to_h.bias', 'layers.20.attention.rotary_emb.inv_freq', 'layers.27.attention.dense.bias', 'layers.12.input_layernorm.bias', 'layers.26.attention.dense.bias', 'layers.22.attention.dense.weight', 'layers.0.mlp.dense_4h_to_h.bias', 'layers.18.input_layernorm.bias', 'layers.23.mlp.dense_4h_to_h.bias', 'layers.3.input_layernorm.weight', 'layers.5.post_attention_layernorm.weight', 'layers.14.post_attention_layernorm.bias', 'layers.27.attention.query_key_value.weight', 'layers.27.mlp.dense_4h_to_h.weight', 'layers.2.post_attention_layernorm.bias', 'layers.29.attention.dense.bias', 'layers.14.attention.dense.bias', 'layers.8.mlp.dense_h_to_4h.bias', 'layers.4.post_attention_layernorm.weight', 'layers.14.mlp.dense_h_to_4h.bias', 'layers.11.post_attention_layernorm.bias', 'layers.23.attention.dense.weight', 'final_layer_norm.weight', 'layers.1.mlp.dense_4h_to_h.bias', 'layers.18.mlp.dense_h_to_4h.weight', 'layers.14.attention.rotary_emb.inv_freq', 'layers.21.attention.rotary_emb.inv_freq', 'layers.13.attention.query_key_value.bias', 'layers.10.attention.query_key_value.bias', 'layers.2.attention.query_key_value.weight', 'layers.18.post_attention_layernorm.bias', 'layers.20.input_layernorm.weight', 'layers.28.attention.rotary_emb.inv_freq', 'layers.16.attention.rotary_emb.inv_freq', 'layers.6.mlp.dense_4h_to_h.weight', 'layers.4.attention.rotary_emb.inv_freq', 'layers.11.attention.query_key_value.weight', 'layers.6.post_attention_layernorm.weight', 'layers.29.attention.dense.weight', 'layers.27.attention.query_key_value.bias', 'layers.11.mlp.dense_4h_to_h.bias', 'layers.5.mlp.dense_4h_to_h.bias', 'layers.11.post_attention_layernorm.weight', 'layers.10.attention.dense.weight', 'layers.15.attention.query_key_value.bias', 'layers.29.post_attention_layernorm.bias', 'layers.17.mlp.dense_h_to_4h.weight', 'layers.4.attention.query_key_value.weight', 'layers.18.attention.query_key_value.bias', 'layers.7.post_attention_layernorm.bias', 'layers.18.input_layernorm.weight', 'layers.27.input_layernorm.weight', 'layers.30.mlp.dense_4h_to_h.weight', 'final_layer_norm.bias', 'layers.15.mlp.dense_4h_to_h.weight', 'layers.1.input_layernorm.weight', 'layers.21.attention.query_key_value.weight', 'layers.9.post_attention_layernorm.weight', 'layers.31.mlp.dense_4h_to_h.weight', 'layers.5.mlp.dense_h_to_4h.bias', 'layers.30.post_attention_layernorm.bias', 'layers.24.input_layernorm.weight', 'layers.9.mlp.dense_h_to_4h.bias', 'layers.27.input_layernorm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

And generates all random text.

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (5 by maintainers)

Most upvoted comments

if you merge the weights, config is available. if not, you can use config from the main model

abhishekkrthakur on Aug 17, 2023

@TapendraBaduwal I wont advice using autotrain with free colab. Instead go for Finetuning using BitsandBytes and SFT with code. https://colab.research.google.com/drive/134o_cXcMe_lsvl15ZE_4Y75Kstepsntu?usp=sharing See this notebook.

ahmadmustafaanis on Aug 17, 2023

What can be done to avoid it crashing at this stage on Google Colab? It appears that it is running out of Memory Loading the checkpoint shards.

apcameron on Aug 5, 2023

The issue turned out to be because of limited capacity of the free colab where the script was getting killed by colab. The script ran successfully on High ram V100 with colab pro

sarda014 on Aug 4, 2023