transformers: Trainer of AutoModelForSequenceClassification is saving the wrong score module (or trained parameters are in the wrong module)

System Info

transformers version: 4.33.1
Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.17.1
Safetensors version: 0.3.3
Accelerate version: 0.22.0
Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 2 - machine_rank: 0 - num_machines: 1 - gpu_ids: 1,2 - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@younesbelkada @muellerzr @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

The trainer setup is like this. The trainer saves checkpoints every 100 steps and does an evaluation of accuracy and F1.

q_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=q_config,
    device_map="auto", 
    num_labels=n_labels,
)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False 

peft_config = LoraConfig(
        r=16,  
        lora_alpha=64, 
        lora_dropout=0.1, 
        bias="none",
        task_type=TaskType.SEQ_CLS,
        target_modules=['v_proj', 'down_proj', 'up_proj', 'q_proj', 'gate_proj', 'k_proj', 'o_proj']
)

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

training_args = TrainingArguments(...)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_test,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("final-checkpoint")

I load the final checkpoint as follows. Note I’ve tried all the other possible ways to load the model as well. The problem is not in the loading.

model = AutoModelForSequenceClassification.from_pretrained(
    "final-checkpoint",
    device_map="auto", 
    num_labels=n_labels,
    quantization_config=q_config
)

When I do inference with this model on the same test dataset used during training, the loss, F1 and accuracy are really bad compared to the output of the training evaluator.

The modules in the trained model look like this:

(score): ModulesToSaveWrapper(
        (original_module): Linear(in_features=5120, out_features=647, bias=False)
        (modules_to_save): ModuleDict(
          (default): Linear(in_features=5120, out_features=647, bias=False)
        )
)

What is being saved to the checkpoints is score.modules_to_save.default.

If I dump score.original_module.weight to a file and load it in a model instantiated from a checkpoint, I get the original loss and metrics. For example (skipped some steps):

# on trained model
orig_weights = trained_model.model.score.original_module.weight.cpu().detach()

# on checkpoint model:
checkpoint_model.score.load_state_dict({"weight": orig_weights})

Expected behavior

Metrics and loss during inference based off a checkpoint should be comparable to the evaluation during training with the same test dataset.

About this issue

Original URL
State: open
Created 10 months ago
Reactions: 2
Comments: 28 (3 by maintainers)

Most upvoted comments

Here is the minimal notebook example wherein I finetune tinyllama on mrpc seq cls task using qlora and targeting all linear layers. When I load the model for inference via AutoPeftModelForSequenceClassification, everything is working as expected. Please let us know if the recent releases have fixed this issue.

tinyllama_qlora_seqcls.ipynb.zip

pacman100 on Feb 6, 2024

Perhaps this post can be beneficial

(https://natan-katz.medium.com/codellama-classification-finetuning-28fa5546f64f

natank1 on Feb 20, 2024

@stephenhbarlow My next task is to use the Llama2 model together with QLoRA for multi-class sequence classification. I’ll let you know about the result by tomorrow.

shakibyzn on Feb 17, 2024

@pacman100 this still seems to be an issue when doing multi-label classification. I just tried now with the latest updates for transformers, peft, and accelerate. I notice your notebook is binary classification not multi-label, which could account for the discrepancy?

stephenhbarlow on Feb 8, 2024

pinging @muellerzr and @pacman100 as it seems this issue still exists

amyeroberts on Dec 12, 2023