transformers: Trainer of AutoModelForSequenceClassification is saving the wrong score module (or trained parameters are in the wrong module)
System Info
transformersversion: 4.33.1- Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.13
- Huggingface_hub version: 0.17.1
- Safetensors version: 0.3.3
- Accelerate version: 0.22.0
- Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 2 - machine_rank: 0 - num_machines: 1 - gpu_ids: 1,2 - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
@younesbelkada @muellerzr @pacman100
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
The trainer setup is like this. The trainer saves checkpoints every 100 steps and does an evaluation of accuracy and F1.
q_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=q_config,
device_map="auto",
num_labels=n_labels,
)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False
peft_config = LoraConfig(
r=16,
lora_alpha=64,
lora_dropout=0.1,
bias="none",
task_type=TaskType.SEQ_CLS,
target_modules=['v_proj', 'down_proj', 'up_proj', 'q_proj', 'gate_proj', 'k_proj', 'o_proj']
)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
training_args = TrainingArguments(...)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds_train,
eval_dataset=ds_test,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("final-checkpoint")
I load the final checkpoint as follows. Note I’ve tried all the other possible ways to load the model as well. The problem is not in the loading.
model = AutoModelForSequenceClassification.from_pretrained(
"final-checkpoint",
device_map="auto",
num_labels=n_labels,
quantization_config=q_config
)
When I do inference with this model on the same test dataset used during training, the loss, F1 and accuracy are really bad compared to the output of the training evaluator.
The modules in the trained model look like this:
(score): ModulesToSaveWrapper(
(original_module): Linear(in_features=5120, out_features=647, bias=False)
(modules_to_save): ModuleDict(
(default): Linear(in_features=5120, out_features=647, bias=False)
)
)
What is being saved to the checkpoints is score.modules_to_save.default.
If I dump score.original_module.weight to a file and load it in a model instantiated from a checkpoint, I get the original loss and metrics.
For example (skipped some steps):
# on trained model
orig_weights = trained_model.model.score.original_module.weight.cpu().detach()
# on checkpoint model:
checkpoint_model.score.load_state_dict({"weight": orig_weights})
Expected behavior
Metrics and loss during inference based off a checkpoint should be comparable to the evaluation during training with the same test dataset.
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 2
- Comments: 28 (3 by maintainers)
Here is the minimal notebook example wherein I finetune tinyllama on mrpc seq cls task using qlora and targeting all linear layers. When I load the model for inference via
AutoPeftModelForSequenceClassification, everything is working as expected. Please let us know if the recent releases have fixed this issue.tinyllama_qlora_seqcls.ipynb.zip
Perhaps this post can be beneficial
(https://natan-katz.medium.com/codellama-classification-finetuning-28fa5546f64f
@stephenhbarlow My next task is to use the Llama2 model together with QLoRA for multi-class sequence classification. I’ll let you know about the result by tomorrow.
@pacman100 this still seems to be an issue when doing multi-label classification. I just tried now with the latest updates for transformers, peft, and accelerate. I notice your notebook is binary classification not multi-label, which could account for the discrepancy?
pinging @muellerzr and @pacman100 as it seems this issue still exists