peft: RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: size mismatch for

Hi, thanks for this awesome library.

I posted an initial query here: https://huggingface.co/databricks/dolly-v2-3b/discussions/19

Reposting below.

I’m fine-tuning dolly-v2-3b. Training is invoked with deepspeed which uses this HF module.

Going by this example, my changes are:

    model = prepare_model_for_int8_training(model, use_gradient_checkpointing=gradient_checkpointing)

    # The dimension used by the LoRA update matrices
    LORA_R = 4
    # Scaling factor
    LORA_ALPHA = 16
    LORA_DROPOUT = 0.05

    # r and alpha together control the total number of final trainable parameters when using LoRA, giving you the flexibility to balance a trade-off between end performance and compute efficiency.
    config = LoraConfig(
        r=LORA_R,
        lora_alpha=LORA_ALPHA,
        lora_dropout=LORA_DROPOUT,
        bias="none",  # Specifies if the bias parameters should be trained
        task_type="CAUSAL_LM",
        # target_modules=["q", "v"],  # I tried with/without this line
    )
    model = get_peft_model(model, config)

It trains successfully, and I end up with a 677kB adapter:

image

Config looks OK:

from peft import PeftConfig
config = PeftConfig.from_pretrained(repo_name)

Out[19]: PeftConfig(peft_type=‘LORA’, base_model_name_or_path=‘databricks/dolly-v2-3b’, task_type=‘CAUSAL_LM’, inference_mode=True)

But, when I try to use the adapter with the base model, I get an error:

from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
# Load the LoRA model
inference_model = PeftModel.from_pretrained(model, repo_name)  # <-- error here
inference_model.eval()
inference_model

Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <command-3660940350576262>:12
      5 model = AutoModelForCausalLM.from_pretrained(
      6     config.base_model_name_or_path,
      7     device_map="auto",
      8     torch_dtype=torch.bfloat16,
      9     trust_remote_code=True,
     10 )
     11 # Load the LoRA model
---> 12 inference_model = PeftModel.from_pretrained(model, repo_name)
     13 inference_model.eval()
     14 inference_model

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/peft/peft_model.py:181, in PeftModel.from_pretrained(cls, model, model_id, adapter_name, is_trainable, **kwargs)
    179 else:
    180     model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config, adapter_name)
--> 181 model.load_adapter(model_id, adapter_name, **kwargs)
    182 return model

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/peft/peft_model.py:384, in PeftModel.load_adapter(self, model_id, adapter_name, is_trainable, **kwargs)
    380 adapters_weights = torch.load(
    381     filename, map_location=torch.device("cuda" if torch.cuda.is_available() else "cpu")
    382 )
    383 # load the weights into the model
--> 384 set_peft_model_state_dict(self, adapters_weights, adapter_name=adapter_name)
    385 if (
    386     (getattr(self, "hf_device_map", None) is not None)
    387     and (len(set(self.hf_device_map.values()).intersection({"cpu", "disk"})) > 0)
    388     and len(self.peft_config) == 1
    389 ):
    390     device_map = kwargs.get("device_map", "auto")

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/peft/utils/save_and_load.py:123, in set_peft_model_state_dict(model, peft_model_state_dict, adapter_name)
    120 else:
    121     raise NotImplementedError
--> 123 model.load_state_dict(peft_model_state_dict, strict=False)
    124 if isinstance(config, PromptLearningConfig):
    125     model.prompt_encoder[adapter_name].embedding.load_state_dict(
    126         {"weight": peft_model_state_dict["prompt_embeddings"]}, strict=True
    127     )

File /databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py:1671, in Module.load_state_dict(self, state_dict, strict)
   1666         error_msgs.insert(
   1667             0, 'Missing key(s) in state_dict: {}. '.format(
   1668                 ', '.join('"{}"'.format(k) for k in missing_keys)))
   1670 if len(error_msgs) > 0:
-> 1671     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   1672                        self.__class__.__name__, "\n\t".join(error_msgs)))
   1673 return _IncompatibleKeys(missing_keys, unexpected_keys)

And then this is also printed out for layers 0 to 31.

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: size mismatch for base_model.model.gpt_neox.layers.0.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([7680, 4]).

Any pointers would be appreciated!

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 5
  • Comments: 21 (1 by maintainers)

Most upvoted comments

Try this, works for me

trainer.save_model("output")
model.save_pretrained("output_lora")
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
from peft import get_peft_model_state_dict
state_dict = get_fp32_state_dict_from_zero_checkpoint("output") # already on cpu
d = get_peft_model_state_dict(model, state_dict=state_dict)
torch.save(d, "output_lora/adapter_model.bin")

It uses the trainer to save the full state dict in deepspeeds own format and then deepspeeds utils to gather the weights back on CPU. Then we can store them using torch and pretend all of this never happened 😄

The problem with using just the peft.save_pretrained method is that it only stores the weights of a single CUDA device, probably the one corresponding to the first python process. all the other tensors have an empty shape.

Edit: The above solution works when "stage3_gather_16bit_weights_on_model_save": false. Alternatively set "stage3_gather_16bit_weights_on_model_save": true in your deepspeed config and that should make the huggingface trainer output a usable state_dict…

I’m also facing the same issue. Removing deepspeed is solving the issue. But I need to keep deepspeed to fine-tune larger models. If you find any solution, please share.

The adapter_model.bin isn’t available when the deepspeed processes exit. I erroneously pushed to hub from within trainer after training, but as it’s parallelised, I only got a quarter of the full adapter

I’ll have to dig into deepspeed and see if I can tell it to collate the adapters, and make it available in the output dir after a run. (Perhaps I can push by rank so I end up with 4x bin files which I can somehow merge after the fact).

But meanwhile, I’ll try the vanilla approach you proposed.

I agree, it’s not elegant and everything should be handled in the save_pretrained method…

My current code is

    trainer.train() 

    trainer.save_model("output") # will only save from the main process
    from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
    from peft import get_peft_model_state_dict
    if training_args.local_rank == 0:
        print(model.state_dict().keys())
        if trainer.deepspeed and not os.path.exists("output/pytorch_model.bin"):
            print("CONVERT Deepspeed Checkpoint to FP32")
            state_dict = get_fp32_state_dict_from_zero_checkpoint("output") # already on cpu
        else:
            print("TRY to use the model directly")
            state_dict = model.cpu().state_dict()
        print("Number of elements in the state dict", sum(p.numel() for p in state_dict.values()))
        d = get_peft_model_state_dict(model, state_dict=state_dict)

        model.save_pretrained("output_lora")
        torch.save(d, "output_lora/adapter_model.bin")

It works elegantly. Thanks a lot!