autotrain-advanced: Finetuned Model does not have a config.json

When we finetune a llm using auto-trained advanced, it does not store a config.json which makes it difficult to load.

It has the following files

README.md
adapter_config.json
adapter_model.bin
optimizer.pt
pytorch_model.bin
Mo_state.pth
scheduler.pt
special_tokens_map json
tokenizerjson
tokenizer_configjson
trainer_state.json
training_args.bin

So when I load it using pipeline, or by default class, it fails.

i.e

# use pipeline to check

import torch
from transformers import pipeline


dolly_llm = pipeline(model="/content/dolly_proj/dolly_v2/checkpoint-150", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

Error:


OSError: dolly_v2/checkpoint-225 does not appear to have a file named config.json. Checkout 'https://huggingface.co/dolly_v2/checkpoint-225/None' for available files.

Or

import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dolly_v2/checkpoint-150", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("dolly_v2/checkpoint-150", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

Error:

OSError: dolly_v2/checkpoint-225 does not appear to have a file named config.json. Checkout 'https://huggingface.co/dolly_v2/checkpoint-225/None' for available files.

Training Command:

!autotrain llm --train --project_name dolly_v2 --model databricks/dolly-v2-3b --data_path . --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 6 --num_train_epochs 3 --trainer sft

It should have the config.json by default

Manually adding config.json and instruct_pipeline from the dollyv2 repo and then loading gives the following warning:

Some weights of GPTNeoXForCausalLM were not initialized from the model checkpoint at /dolly_v2/checkpoint-225 and are newly initialized: ['layers.12.mlp.dense_4h_to_h.bias', 'layers.0.attention.query_key_value.weight', 'layers.5.post_attention_layernorm.bias', 'layers.16.attention.dense.bias', 'layers.16.mlp.dense_h_to_4h.weight', 'layers.9.attention.query_key_value.bias', 'layers.3.mlp.dense_h_to_4h.weight', 'layers.9.post_attention_layernorm.bias', 'layers.1.attention.dense.weight', 'layers.22.attention.dense.bias', 'layers.8.attention.query_key_value.bias', 'layers.5.mlp.dense_4h_to_h.weight', 'layers.13.input_layernorm.weight', 'layers.14.attention.query_key_value.bias', 'layers.30.mlp.dense_h_to_4h.bias', 'layers.20.mlp.dense_h_to_4h.bias', 'layers.22.mlp.dense_h_to_4h.bias', 'layers.3.attention.rotary_emb.inv_freq', 'layers.22.mlp.dense_4h_to_h.bias', 'layers.23.input_layernorm.bias', 'layers.9.attention.dense.bias', 'layers.7.input_layernorm.bias', 'layers.19.mlp.dense_4h_to_h.weight', 'layers.2.attention.dense.weight', 'layers.2.input_layernorm.weight', 'layers.0.input_layernorm.bias', 'layers.25.post_attention_layernorm.bias', 'layers.6.attention.query_key_value.weight', 'layers.27.post_attention_layernorm.weight', 'layers.24.mlp.dense_h_to_4h.weight', 'layers.13.mlp.dense_h_to_4h.bias', 'layers.17.mlp.dense_4h_to_h.weight', 'layers.16.attention.dense.weight', 'layers.6.attention.query_key_value.bias', 'layers.14.post_attention_layernorm.weight', 'layers.11.mlp.dense_h_to_4h.weight', 'layers.14.mlp.dense_4h_to_h.weight', 'layers.28.post_attention_layernorm.bias', 'layers.27.attention.dense.weight', 'layers.25.input_layernorm.bias', 'layers.11.mlp.dense_h_to_4h.bias', 'layers.6.mlp.dense_4h_to_h.bias', 'layers.5.attention.query_key_value.bias', 'layers.21.attention.dense.bias', 'layers.8.post_attention_layernorm.weight', 'layers.17.attention.dense.bias', 'layers.29.mlp.dense_4h_to_h.weight', 'layers.15.mlp.dense_h_to_4h.bias', 'layers.7.input_layernorm.weight', 'layers.4.input_layernorm.bias', 'layers.20.attention.dense.bias', 'layers.15.post_attention_layernorm.weight', 'layers.23.mlp.dense_h_to_4h.weight', 'layers.3.mlp.dense_4h_to_h.weight', 'layers.4.attention.dense.weight', 'layers.12.mlp.dense_4h_to_h.weight', 'layers.27.mlp.dense_h_to_4h.weight', 'layers.3.post_attention_layernorm.weight', 'layers.28.attention.dense.bias', 'layers.23.post_attention_layernorm.bias', 'layers.23.mlp.dense_h_to_4h.bias', 'layers.0.attention.dense.bias', 'layers.10.post_attention_layernorm.bias', 'layers.24.attention.query_key_value.bias', 'layers.26.post_attention_layernorm.bias', 'layers.18.attention.dense.weight', 'layers.31.input_layernorm.bias', 'layers.16.input_layernorm.weight', 'layers.12.mlp.dense_h_to_4h.weight', 'layers.13.attention.dense.bias', 'layers.9.attention.dense.weight', 'layers.22.mlp.dense_h_to_4h.weight', 'layers.8.attention.dense.bias', 'layers.25.attention.query_key_value.bias', 'layers.12.input_layernorm.weight', 'layers.16.post_attention_layernorm.bias', 'layers.19.attention.query_key_value.bias', 'layers.0.input_layernorm.weight', 'layers.26.input_layernorm.weight', 'layers.5.input_layernorm.weight', 'layers.24.post_attention_layernorm.bias', 'layers.17.post_attention_layernorm.bias', 'layers.3.attention.dense.weight', 'layers.6.attention.dense.weight', 'layers.8.input_layernorm.bias', 'layers.17.attention.rotary_emb.inv_freq', 'layers.25.mlp.dense_4h_to_h.weight', 'layers.22.input_layernorm.bias', 'layers.12.post_attention_layernorm.bias', 'layers.21.mlp.dense_h_to_4h.weight', 'layers.10.attention.query_key_value.weight', 'layers.2.input_layernorm.bias', 'layers.28.input_layernorm.bias', 'layers.1.post_attention_layernorm.bias', 'layers.27.mlp.dense_4h_to_h.bias', 'layers.0.post_attention_layernorm.bias', 'layers.13.input_layernorm.bias', 'layers.28.attention.query_key_value.weight', 'layers.20.attention.dense.weight', 'layers.2.mlp.dense_h_to_4h.bias', 'layers.4.post_attention_layernorm.bias', 'layers.20.attention.query_key_value.weight', 'layers.23.attention.query_key_value.bias', 'layers.21.attention.dense.weight', 'layers.21.attention.query_key_value.bias', 'layers.1.attention.rotary_emb.inv_freq', 'layers.11.attention.rotary_emb.inv_freq', 'layers.18.mlp.dense_4h_to_h.bias', 'layers.6.attention.rotary_emb.inv_freq', 'layers.13.post_attention_layernorm.bias', 'layers.0.mlp.dense_h_to_4h.bias', 'layers.26.post_attention_layernorm.weight', 'layers.10.mlp.dense_h_to_4h.bias', 'layers.16.mlp.dense_4h_to_h.weight', 'layers.20.post_attention_layernorm.bias', 'layers.30.post_attention_layernorm.weight', 'layers.12.attention.dense.weight', 'layers.3.attention.dense.bias', 'layers.28.mlp.dense_h_to_4h.bias', 'layers.16.attention.query_key_value.weight', 'layers.26.mlp.dense_4h_to_h.weight', 'layers.19.post_attention_layernorm.weight', 'layers.12.mlp.dense_h_to_4h.bias', 'layers.1.input_layernorm.bias', 'layers.26.mlp.dense_4h_to_h.bias', 'layers.12.attention.query_key_value.weight', 'layers.24.mlp.dense_h_to_4h.bias', 'layers.30.attention.query_key_value.weight', 'layers.1.mlp.dense_h_to_4h.bias', 'layers.19.input_layernorm.bias', 'layers.31.input_layernorm.weight', 'layers.3.attention.query_key_value.weight', 'layers.23.attention.query_key_value.weight', 'layers.23.attention.rotary_emb.inv_freq', 'layers.3.mlp.dense_h_to_4h.bias', 'layers.24.post_attention_layernorm.weight', 'layers.28.mlp.dense_4h_to_h.weight', 'layers.17.attention.dense.weight', 'layers.0.attention.rotary_emb.inv_freq', 'layers.23.input_layernorm.weight', 'layers.24.attention.query_key_value.weight', 'layers.8.attention.rotary_emb.inv_freq', 'layers.22.input_layernorm.weight', 'layers.10.post_attention_layernorm.weight', 'layers.18.post_attention_layernorm.weight', 'layers.8.attention.query_key_value.weight', 'layers.31.mlp.dense_h_to_4h.bias', 'layers.5.attention.rotary_emb.inv_freq', 'layers.7.attention.query_key_value.bias', 'layers.5.input_layernorm.bias', 'layers.10.mlp.dense_h_to_4h.weight', 'layers.11.input_layernorm.bias', 'layers.7.mlp.dense_4h_to_h.bias', 'layers.19.mlp.dense_h_to_4h.weight', 'layers.4.input_layernorm.weight', 'layers.13.attention.rotary_emb.inv_freq', 'layers.20.post_attention_layernorm.weight', 'layers.17.mlp.dense_4h_to_h.bias', 'layers.4.mlp.dense_h_to_4h.weight', 'layers.15.post_attention_layernorm.bias', 'layers.0.attention.query_key_value.bias', 'layers.24.attention.dense.weight', 'layers.1.mlp.dense_h_to_4h.weight', 'layers.15.input_layernorm.weight', 'layers.28.mlp.dense_4h_to_h.bias', 'layers.30.attention.dense.bias', 'layers.7.attention.query_key_value.weight', 'layers.9.attention.rotary_emb.inv_freq', 'layers.9.attention.query_key_value.weight', 'layers.28.input_layernorm.weight', 'layers.1.attention.dense.bias', 'layers.3.post_attention_layernorm.bias', 'layers.25.mlp.dense_h_to_4h.bias', 'layers.6.mlp.dense_h_to_4h.weight', 'layers.26.input_layernorm.bias', 'layers.21.mlp.dense_4h_to_h.bias', 'layers.19.input_layernorm.weight', 'layers.14.input_layernorm.bias', 'layers.8.input_layernorm.weight', 'layers.19.attention.dense.weight', 'layers.6.input_layernorm.bias', 'layers.31.attention.query_key_value.weight', 'layers.26.attention.rotary_emb.inv_freq', 'layers.13.mlp.dense_4h_to_h.bias', 'layers.13.mlp.dense_h_to_4h.weight', 'layers.22.mlp.dense_4h_to_h.weight', 'layers.30.mlp.dense_4h_to_h.bias', 'layers.24.mlp.dense_4h_to_h.weight', 'layers.27.mlp.dense_h_to_4h.bias', 'layers.24.mlp.dense_4h_to_h.bias', 'layers.9.mlp.dense_h_to_4h.weight', 'layers.7.attention.dense.bias', 'layers.2.post_attention_layernorm.weight', 'layers.5.attention.dense.bias', 'layers.20.mlp.dense_4h_to_h.weight', 'layers.15.attention.query_key_value.weight', 'layers.7.mlp.dense_h_to_4h.bias', 'layers.0.mlp.dense_4h_to_h.weight', 'layers.11.attention.dense.bias', 'layers.7.attention.dense.weight', 'layers.16.mlp.dense_h_to_4h.bias', 'layers.29.mlp.dense_h_to_4h.weight', 'layers.28.attention.query_key_value.bias', 'layers.9.input_layernorm.bias', 'layers.15.input_layernorm.bias', 'layers.11.input_layernorm.weight', 'layers.22.attention.query_key_value.bias', 'layers.22.attention.query_key_value.weight', 'layers.26.attention.dense.weight', 'layers.2.mlp.dense_4h_to_h.weight', 'layers.15.attention.dense.weight', 'layers.26.mlp.dense_h_to_4h.weight', 'layers.31.post_attention_layernorm.bias', 'layers.19.mlp.dense_h_to_4h.bias', 'layers.17.post_attention_layernorm.weight', 'layers.30.attention.query_key_value.bias', 'layers.29.input_layernorm.bias', 'layers.18.mlp.dense_h_to_4h.bias', 'layers.21.mlp.dense_h_to_4h.bias', 'layers.25.post_attention_layernorm.weight', 'layers.14.attention.dense.weight', 'layers.15.attention.rotary_emb.inv_freq', 'layers.21.input_layernorm.weight', 'layers.0.mlp.dense_h_to_4h.weight', 'layers.12.attention.dense.bias', 'layers.2.attention.rotary_emb.inv_freq', 'layers.11.attention.query_key_value.bias', 'layers.30.mlp.dense_h_to_4h.weight', 'layers.29.post_attention_layernorm.weight', 'layers.18.attention.query_key_value.weight', 'layers.14.attention.query_key_value.weight', 'layers.7.attention.rotary_emb.inv_freq', 'layers.12.attention.rotary_emb.inv_freq', 'layers.19.attention.rotary_emb.inv_freq', 'layers.21.mlp.dense_4h_to_h.weight', 'layers.8.mlp.dense_h_to_4h.weight', 'layers.27.attention.rotary_emb.inv_freq', 'layers.7.mlp.dense_4h_to_h.weight', 'layers.8.mlp.dense_4h_to_h.bias', 'layers.14.input_layernorm.weight', 'layers.2.mlp.dense_4h_to_h.bias', 'layers.17.mlp.dense_h_to_4h.bias', 'layers.4.mlp.dense_4h_to_h.weight', 'layers.15.mlp.dense_4h_to_h.bias', 'layers.31.attention.dense.bias', 'layers.19.mlp.dense_4h_to_h.bias', 'layers.22.attention.rotary_emb.inv_freq', 'layers.8.mlp.dense_4h_to_h.weight', 'layers.10.attention.dense.bias', 'layers.19.attention.query_key_value.weight', 'layers.3.input_layernorm.bias', 'layers.10.input_layernorm.bias', 'layers.25.input_layernorm.weight', 'layers.31.attention.dense.weight', 'layers.23.attention.dense.bias', 'layers.19.attention.dense.bias', 'layers.20.mlp.dense_h_to_4h.weight', 'layers.21.post_attention_layernorm.bias', 'layers.3.attention.query_key_value.bias', 'layers.0.attention.dense.weight', 'layers.28.attention.dense.weight', 'layers.30.attention.rotary_emb.inv_freq', 'layers.8.post_attention_layernorm.bias', 'layers.6.input_layernorm.weight', 'layers.16.attention.query_key_value.bias', 'layers.15.attention.dense.bias', 'layers.30.attention.dense.weight', 'layers.17.attention.query_key_value.bias', 'layers.31.mlp.dense_4h_to_h.bias', 'layers.25.mlp.dense_4h_to_h.bias', 'layers.24.input_layernorm.bias', 'layers.26.attention.query_key_value.weight', 'layers.29.attention.rotary_emb.inv_freq', 'layers.7.post_attention_layernorm.weight', 'layers.24.attention.rotary_emb.inv_freq', 'layers.13.post_attention_layernorm.weight', 'layers.29.mlp.dense_4h_to_h.bias', 'layers.17.input_layernorm.bias', 'layers.13.mlp.dense_4h_to_h.weight', 'layers.20.input_layernorm.bias', 'layers.0.post_attention_layernorm.weight', 'layers.31.mlp.dense_h_to_4h.weight', 'layers.14.mlp.dense_4h_to_h.bias', 'layers.27.post_attention_layernorm.bias', 'layers.18.attention.rotary_emb.inv_freq', 'layers.29.input_layernorm.weight', 'layers.24.attention.dense.bias', 'layers.7.mlp.dense_h_to_4h.weight', 'layers.18.mlp.dense_4h_to_h.weight', 'layers.21.post_attention_layernorm.weight', 'layers.30.input_layernorm.bias', 'layers.25.attention.query_key_value.weight', 'layers.1.mlp.dense_4h_to_h.weight', 'layers.8.attention.dense.weight', 'layers.10.mlp.dense_4h_to_h.weight', 'layers.1.post_attention_layernorm.weight', 'layers.2.attention.dense.bias', 'layers.15.mlp.dense_h_to_4h.weight', 'layers.29.attention.query_key_value.weight', 'layers.9.mlp.dense_4h_to_h.bias', 'layers.14.mlp.dense_h_to_4h.weight', 'layers.2.attention.query_key_value.bias', 'layers.29.mlp.dense_h_to_4h.bias', 'layers.25.mlp.dense_h_to_4h.weight', 'layers.2.mlp.dense_h_to_4h.weight', 'layers.25.attention.dense.weight', 'layers.25.attention.dense.bias', 'layers.10.input_layernorm.weight', 'layers.28.mlp.dense_h_to_4h.weight', 'layers.5.attention.dense.weight', 'layers.4.mlp.dense_4h_to_h.bias', 'embed_in.weight', 'layers.5.attention.query_key_value.weight', 'layers.4.attention.query_key_value.bias', 'layers.4.attention.dense.bias', 'layers.1.attention.query_key_value.bias', 'layers.4.mlp.dense_h_to_4h.bias', 'layers.31.post_attention_layernorm.weight', 'layers.28.post_attention_layernorm.weight', 'layers.12.attention.query_key_value.bias', 'layers.29.attention.query_key_value.bias', 'layers.31.attention.rotary_emb.inv_freq', 'layers.31.attention.query_key_value.bias', 'layers.6.attention.dense.bias', 'layers.3.mlp.dense_4h_to_h.bias', 'layers.10.mlp.dense_4h_to_h.bias', 'layers.22.post_attention_layernorm.bias', 'layers.16.post_attention_layernorm.weight', 'layers.9.mlp.dense_4h_to_h.weight', 'layers.16.input_layernorm.bias', 'layers.17.attention.query_key_value.weight', 'layers.11.attention.dense.weight', 'layers.23.post_attention_layernorm.weight', 'layers.13.attention.query_key_value.weight', 'layers.10.attention.rotary_emb.inv_freq', 'layers.25.attention.rotary_emb.inv_freq', 'layers.16.mlp.dense_4h_to_h.bias', 'layers.11.mlp.dense_4h_to_h.weight', 'layers.13.attention.dense.weight', 'embed_out.weight', 'layers.12.post_attention_layernorm.weight', 'layers.20.attention.query_key_value.bias', 'layers.21.input_layernorm.bias', 'layers.18.attention.dense.bias', 'layers.30.input_layernorm.weight', 'layers.26.attention.query_key_value.bias', 'layers.1.attention.query_key_value.weight', 'layers.26.mlp.dense_h_to_4h.bias', 'layers.5.mlp.dense_h_to_4h.weight', 'layers.22.post_attention_layernorm.weight', 'layers.6.post_attention_layernorm.bias', 'layers.6.mlp.dense_h_to_4h.bias', 'layers.19.post_attention_layernorm.bias', 'layers.17.input_layernorm.weight', 'layers.23.mlp.dense_4h_to_h.weight', 'layers.9.input_layernorm.weight', 'layers.20.mlp.dense_4h_to_h.bias', 'layers.20.attention.rotary_emb.inv_freq', 'layers.27.attention.dense.bias', 'layers.12.input_layernorm.bias', 'layers.26.attention.dense.bias', 'layers.22.attention.dense.weight', 'layers.0.mlp.dense_4h_to_h.bias', 'layers.18.input_layernorm.bias', 'layers.23.mlp.dense_4h_to_h.bias', 'layers.3.input_layernorm.weight', 'layers.5.post_attention_layernorm.weight', 'layers.14.post_attention_layernorm.bias', 'layers.27.attention.query_key_value.weight', 'layers.27.mlp.dense_4h_to_h.weight', 'layers.2.post_attention_layernorm.bias', 'layers.29.attention.dense.bias', 'layers.14.attention.dense.bias', 'layers.8.mlp.dense_h_to_4h.bias', 'layers.4.post_attention_layernorm.weight', 'layers.14.mlp.dense_h_to_4h.bias', 'layers.11.post_attention_layernorm.bias', 'layers.23.attention.dense.weight', 'final_layer_norm.weight', 'layers.1.mlp.dense_4h_to_h.bias', 'layers.18.mlp.dense_h_to_4h.weight', 'layers.14.attention.rotary_emb.inv_freq', 'layers.21.attention.rotary_emb.inv_freq', 'layers.13.attention.query_key_value.bias', 'layers.10.attention.query_key_value.bias', 'layers.2.attention.query_key_value.weight', 'layers.18.post_attention_layernorm.bias', 'layers.20.input_layernorm.weight', 'layers.28.attention.rotary_emb.inv_freq', 'layers.16.attention.rotary_emb.inv_freq', 'layers.6.mlp.dense_4h_to_h.weight', 'layers.4.attention.rotary_emb.inv_freq', 'layers.11.attention.query_key_value.weight', 'layers.6.post_attention_layernorm.weight', 'layers.29.attention.dense.weight', 'layers.27.attention.query_key_value.bias', 'layers.11.mlp.dense_4h_to_h.bias', 'layers.5.mlp.dense_4h_to_h.bias', 'layers.11.post_attention_layernorm.weight', 'layers.10.attention.dense.weight', 'layers.15.attention.query_key_value.bias', 'layers.29.post_attention_layernorm.bias', 'layers.17.mlp.dense_h_to_4h.weight', 'layers.4.attention.query_key_value.weight', 'layers.18.attention.query_key_value.bias', 'layers.7.post_attention_layernorm.bias', 'layers.18.input_layernorm.weight', 'layers.27.input_layernorm.weight', 'layers.30.mlp.dense_4h_to_h.weight', 'final_layer_norm.bias', 'layers.15.mlp.dense_4h_to_h.weight', 'layers.1.input_layernorm.weight', 'layers.21.attention.query_key_value.weight', 'layers.9.post_attention_layernorm.weight', 'layers.31.mlp.dense_4h_to_h.weight', 'layers.5.mlp.dense_h_to_4h.bias', 'layers.30.post_attention_layernorm.bias', 'layers.24.input_layernorm.weight', 'layers.9.mlp.dense_h_to_4h.bias', 'layers.27.input_layernorm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

And generates all random text.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

if you merge the weights, config is available. if not, you can use config from the main model

@TapendraBaduwal I wont advice using autotrain with free colab. Instead go for Finetuning using BitsandBytes and SFT with code. https://colab.research.google.com/drive/134o_cXcMe_lsvl15ZE_4Y75Kstepsntu?usp=sharing See this notebook.

What can be done to avoid it crashing at this stage on Google Colab? It appears that it is running out of Memory Loading the checkpoint shards.

The issue turned out to be because of limited capacity of the free colab where the script was getting killed by colab. The script ran successfully on High ram V100 with colab pro