peft: modules_to_save: "ValueError: Attempting to unscale FP16 gradients"

I’m trying to finetune llama with some expanded tokens using resize_token_embeddings() and passing modules_to_save=['embed_tokens', 'lm_head'], but it seems there is some misconfiguration

Traceback (most recent call last):
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1962, in _inner_training_loop
    self.scaler.unscale_(self.optimizer)
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")

About this issue

Original URL
State: closed
Created a year ago
Comments: 22 (5 by maintainers)

Commits related to this issue

DOC Troubleshooting for unscaling error with fp16 Some users ran into the issue of trying to use a model loaded in float16 with mixed precision, e.g. these issues: #341, #1249. This PR documents a wo... — committed to BenjaminBossan/peft by BenjaminBossan 6 months ago
DOC Troubleshooting for unscaling error with fp16 (#1336) Some users ran into the issue of trying to use a model loaded in float16 with mixed precision, e.g. these issues: #341, #1249. This PR docum... — committed to huggingface/peft by BenjaminBossan 6 months ago

Most upvoted comments

New idea: Now the training finally works. Setting fp16=False would make the training be super slow and not mem-friendly.

To avoid “ValueError: Attempting to unscale FP16 gradients”, just make sure each trainable params to be in type ‘torch.float32’. In my case just:

model.base_model.model.model.embed_tokens.weight.data = model.base_model.model.model.embed_tokens.weight.data.float()
model.base_model.model.lm_head.weight.data = model.base_model.model.lm_head.weight.data.float()

It seems like a bug from pytorch side.

+14

DavidVillaHMD on Apr 24, 2023

New idea: Now the training finally works. Setting fp16=False would make the training be super slow and not mem-friendly.

To avoid “ValueError: Attempting to unscale FP16 gradients”, just make sure each trainable params to be in type ‘torch.float32’. In my case just:
model.base_model.model.model.embed_tokens.weight.data = model.base_model.model.model.embed_tokens.weight.data.float()
model.base_model.model.lm_head.weight.data = model.base_model.model.lm_head.weight.data.float()
It seems like a bug from pytorch side.

so clever, dude. thanks for your idea

qiguanqiang on Apr 24, 2023

Any updates on this issue? Still seeing this bug

Which one exactly do you mean? Note that for the case of loading the model in float16, you have to follow the advice given above.

A snippet that should work a little bit more generally:
for param in model.parameters():
    if param.requires_grad:
        param.data = param.data.to(torch.float32)

This is my use case test: Break with raise ValueError("Attempting to unscale FP16 gradients.") under below configs.

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )

No error for below cases

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=None,
      )

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float32,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=False,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )

I am confused about how to understand the relation between, torch_dtype, fp16, modules_to_save?

hengjiUSTC on Jan 9, 2024

Thanks for providing an example. I tried it (using opt) and it crashed even with modules_to_save=None. Checking the dtypes of the learnable parameters, they are fp16, so the crash is expected. Not sure what the source of the difference is, but either way, I think it’s safe to say that when loading in fp16, it’s best to cast the trainable weights to fp32. PR #1318 will introduce a convenience function cast_non_trainable_to_dtype to do this quickly.

BenjaminBossan on Jan 10, 2024

Any updates on this issue? Still seeing this bug

Which one exactly do you mean? Note that for the case of loading the model in float16, you have to follow the advice given above.

A snippet that should work a little bit more generally:

for param in model.parameters():
    if param.requires_grad:
        param.data = param.data.to(torch.float32)

BenjaminBossan on Jan 9, 2024