peft: modules_to_save: "ValueError: Attempting to unscale FP16 gradients"

I’m trying to finetune llama with some expanded tokens using resize_token_embeddings() and passing modules_to_save=['embed_tokens', 'lm_head'], but it seems there is some misconfiguration

Traceback (most recent call last):
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1962, in _inner_training_loop
    self.scaler.unscale_(self.optimizer)
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (5 by maintainers)

Commits related to this issue

Most upvoted comments

New idea: Now the training finally works. Setting fp16=False would make the training be super slow and not mem-friendly.

To avoid “ValueError: Attempting to unscale FP16 gradients”, just make sure each trainable params to be in type ‘torch.float32’. In my case just:

model.base_model.model.model.embed_tokens.weight.data = model.base_model.model.model.embed_tokens.weight.data.float()
model.base_model.model.lm_head.weight.data = model.base_model.model.lm_head.weight.data.float()

It seems like a bug from pytorch side.

New idea: Now the training finally works. Setting fp16=False would make the training be super slow and not mem-friendly.

To avoid “ValueError: Attempting to unscale FP16 gradients”, just make sure each trainable params to be in type ‘torch.float32’. In my case just:

model.base_model.model.model.embed_tokens.weight.data = model.base_model.model.model.embed_tokens.weight.data.float()
model.base_model.model.lm_head.weight.data = model.base_model.model.lm_head.weight.data.float()

It seems like a bug from pytorch side.

so clever, dude. thanks for your idea

Any updates on this issue? Still seeing this bug

Which one exactly do you mean? Note that for the case of loading the model in float16, you have to follow the advice given above.

A snippet that should work a little bit more generally:

for param in model.parameters():
    if param.requires_grad:
        param.data = param.data.to(torch.float32)

This is my use case test: Break with raise ValueError("Attempting to unscale FP16 gradients.") under below configs.

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )

No error for below cases

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=None,
      )
    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float32,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )
    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=False,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )

I am confused about how to understand the relation between, torch_dtype, fp16, modules_to_save?

Thanks for providing an example. I tried it (using opt) and it crashed even with modules_to_save=None. Checking the dtypes of the learnable parameters, they are fp16, so the crash is expected. Not sure what the source of the difference is, but either way, I think it’s safe to say that when loading in fp16, it’s best to cast the trainable weights to fp32. PR #1318 will introduce a convenience function cast_non_trainable_to_dtype to do this quickly.

Any updates on this issue? Still seeing this bug

Which one exactly do you mean? Note that for the case of loading the model in float16, you have to follow the advice given above.

A snippet that should work a little bit more generally:

for param in model.parameters():
    if param.requires_grad:
        param.data = param.data.to(torch.float32)