peft: model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora

I recently found that when fine-tuning using alpaca-lora, model.save_pretrained() will save a adapter_model.bin that is only 443 B.

This seems to be happening after peft@75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495.

Normally adapter_model.bin should be > 16 MB. And while the 443 B adapter_model.bin is loaded, the model behaves like not fine-tuned at all. In contrast, loading other checkpoints from the same training works as expected.

drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr  9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr  9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu  350 Apr  9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu  443 Apr  9 12:55 adapter_model.bin
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:06 checkpoint-400
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:06 checkpoint-600
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:07 checkpoint-800

I’m not sure if this is an issue to peft or not, or is this a duplication of other issues, but just leaving this for reference.

I’ve been testing with multiple versions of peft:

  • 072da6d9d62 works
  • 382b178911edff38c1ff619bbac2ba556bd2276b works
  • 75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495 not working
  • 445940fb7b5d38390ffb6707e2a989e89fff03b5 not working
  • 1a6151b91fcdcc25326b9807d7dbf54e091d506c not working
  • 1117d4772109a098787ce7fc297cb6cd641de6eb not working

Steps to reproduce:

conda create python=3.8 -n test
conda activate test
git clone https://github.com/tloen/alpaca-lora.git
cd alpaca-lora
pip install -r requirements.txt

# to workaround AttributeError: bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
cd /home/ubuntu/miniconda3/envs/test/lib/python3.8/site-packages/bitsandbytes/
mv libbitsandbytes_cpu.so libbitsandbytes_cpu.so.bak
cp libbitsandbytes_cuda121.so libbitsandbytes_cpu.so
cd -
conda install cudatoolkit

# alpaca_data_cleaned_first_100.json is alpaca_data_cleaned.json with only the first 100 items, setting --val_set_size 0 because there're not enough data to build the test set
python finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path '/data/datasets/alpaca_data_cleaned_first_100.json' --output_dir './lora-alpaca' --val_set_size 0
$ ls -alh lora-alpaca
total 16K
drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr  9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr  9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu  350 Apr  9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu  443 Apr  9 12:55 adapter_model.bin

(adapter_model.bin should normally be around 16 MB)

Running on Lambda Cloud A10 instance.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 8
  • Comments: 15 (1 by maintainers)

Commits related to this issue

Most upvoted comments

The issue is with these lines of code. It messes with the model state_dict, so the second time it’s called from the save_pretrained() method it returns None. As I understand it, now one doesn’t have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

same issue with this comment [https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

same issue with this comment [https://github.com/https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]

Hello, the correct way to save the intermediate checkpoints for PEFT when using Trainer would be to use Callbacks. An example is shown here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR


class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

@justinphan3110 No, just gave it up. It took me too much time to debug. At the beginning, I thought there should be something wrong with get_peft_model_state_dict, but it cannot explain the success of saving llama-7b lora. I printed the first elements in state_dict for both base model and lora in my training script, and found the keys were there but missing the values (i.e., only [ ]). I guess there is some incompatibility between PEFT and Zero3. I’ll just wait.

Thanks @s4rduk4r, for suggesting removing the lines related to model.state_dict. I haven’t confirmed it by myself, but as @richardklafter’s confirmation and I found the author of alpaca-lora had also suggested removing those lines of code to fix another issue, I agree that we can close this and move the discussion to why those lines codes are added.

I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?

The issue is with these lines of code. It messes with the model state_dict, so the second time it’s called from the save_pretrained() method it returns None. As I understand it, now one doesn’t have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

Thanks man @s4rduk4r !!! You saved my day.