transformers: RuntimeError: unscale_() has already been called on this optimizer since the last update().

System Info

transformers version: 4.30.0.dev0
Platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.31
Python version: 3.9.16
Huggingface_hub version: 0.14.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): 2.9.2 (True)
Flax version (CPU?/GPU?/TPU?): 0.6.3 (gpu)
Jax version: 0.4.1
JaxLib version: 0.4.1
Using GPU in script?: yes
Using distributed or parallel set-up in script?: not sure, see colab below

Who can help?

@younesbelkada @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Run the colab: https://colab.research.google.com/drive/1ARmlaZZaKyAg6HTi57psFLPeh0hDRcPX?usp=sharing#scrollTo=Duak7T_B3VpJ

@younesbelkada @pacman100 I have reinstalled from source as suggested after the fix in (https://github.com/huggingface/transformers/pull/23914/files) but I still get the error. I’m in the latest commit transformers @ git+https://github.com/huggingface/transformers.git@fabe17a726bbf6081cfbcc975d8ac451a81f3e2d and you can tell from the stacktrace that the line numbers are different (due to the changes to fix the problem when using QLora). The script does not use QLora (afaik)

am I missing something?

Expected behavior

The call trainer.train() should work and instead produces the following exception:

│ in <cell line: 17>:17                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1661 in train                    │
│                                                                                                  │
│   1658 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1659 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1660 │   │   )                                                                                 │
│ ❱ 1661 │   │   return inner_training_loop(                                                       │
│   1662 │   │   │   args=args,                                                                    │
│   1663 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1664 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1995 in _inner_training_loop     │
│                                                                                                  │
│   1992 │   │   │   │   │   │   │   │   args.max_grad_norm,                                       │
│   1993 │   │   │   │   │   │   │   )                                                             │
│   1994 │   │   │   │   │   │   else:                                                             │
│ ❱ 1995 │   │   │   │   │   │   │   self.accelerator.clip_grad_norm_(                             │
│   1996 │   │   │   │   │   │   │   │   model.parameters(),                                       │
│   1997 │   │   │   │   │   │   │   │   args.max_grad_norm,                                       │
│   1998 │   │   │   │   │   │   │   )                                                             │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:1817 in clip_grad_norm_        │
│                                                                                                  │
│   1814 │   │   │   # `accelerator.backward(loss)` is doing that automatically. Therefore, its i  │
│   1815 │   │   │   # We cannot return the gradient norm because DeepSpeed does it.               │
│   1816 │   │   │   return None                                                                   │
│ ❱ 1817 │   │   self.unscale_gradients()                                                          │
│   1818 │   │   return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)  │
│   1819 │                                                                                         │
│   1820 │   def clip_grad_value_(self, parameters, clip_value):                                   │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:1780 in unscale_gradients      │
│                                                                                                  │
│   1777 │   │   │   for opt in optimizer:                                                         │
│   1778 │   │   │   │   while isinstance(opt, AcceleratedOptimizer):                              │
│   1779 │   │   │   │   │   opt = opt.optimizer                                                   │
│ ❱ 1780 │   │   │   │   self.scaler.unscale_(opt)                                                 │
│   1781 │                                                                                         │
│   1782 │   def clip_grad_norm_(self, parameters, max_norm, norm_type=2):                         │
│   1783 │   │   """                                                                               │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py:275 in unscale_            │
│                                                                                                  │
│   272 │   │   optimizer_state = self._per_optimizer_states[id(optimizer)]                        │
│   273 │   │                                                                                      │
│   274 │   │   if optimizer_state["stage"] is OptState.UNSCALED:                                  │
│ ❱ 275 │   │   │   raise RuntimeError("unscale_() has already been called on this optimizer sin   │
│   276 │   │   elif optimizer_state["stage"] is OptState.STEPPED:                                 │
│   277 │   │   │   raise RuntimeError("unscale_() is being called after step().")                 │
│   278                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: unscale_() has already been called on this optimizer since the last update().

About this issue

Original URL
State: closed
Created a year ago
Comments: 33 (8 by maintainers)

Most upvoted comments

Had the same issue yesterday. Kept getting the error even though I updated the transformer library. I believe what helped was to pip uninstall transformers, restart the kernel, and then pip install -U git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9.

+14

leoplusx on Jun 1, 2023

I am still able to reproduce this double unscale_() issue with the original stack trace.

Using Falcon-Guanaco.ipynb and making the following modifications:

Add dataset = dataset.shard(num_shards=80, index=0) before constructing SFTTrainer.

Change max_seq_length = 512 to max_seq_length = 1024.

After these modifications the trainer.train() call reliably fails.

Debugging I see the following steps happen before the error:

The step() call in accelerate/optimizer.py returns immediately because of the self.gradient_state.sync_gradients condition. As a result, the optimizer_state["stage"] is never transitioned to OptState.READY.

optimizer_was_run in the call from transformers/trainer.py is (incorrectly?) set to True here.

On the next iteration, the double unscale_() error is raised since we call clip_grad_norm_ every time.

Edit: Fix on my fork is working.

Hello, able to reproduce this. cc @muellerzr

Reason: Gradient Accumulation in trainer is happening across the epochs because of total_batched_samples . However, Accelerate resets step at the end of an epoch leading to sync_gradients being False and optimizer not being run and when the next time clip_grad_norm_ is called, it leads to unscale_() has already been called on this optimizer since the last update().

def _do_sync(self):
        "Sets the right `sync_gradients` context and either resets or increases `self.step`"
        if self.gradient_state.end_of_dataloader:
            self.step = 0
            self.gradient_state._set_sync_gradients(True)
        else:
            self.step += 1
            self.gradient_state._set_sync_gradients((self.step % self.gradient_state.num_steps) == 0)

pacman100 on Jun 19, 2023

Hello everyone, the above PR #24415 should resolve the issues with grad_acc around epoch boundaries.

pacman100 on Jun 22, 2023

I am still able to reproduce this double unscale_() issue with the original stack trace.

Using Falcon-Guanaco.ipynb and making the following modifications:

Add dataset = dataset.shard(num_shards=80, index=0) before constructing SFTTrainer.
Change max_seq_length = 512 to max_seq_length = 1024.

After these modifications the trainer.train() call reliably fails.

Debugging I see the following steps happen before the error:

The step() call in accelerate/optimizer.py returns immediately because of the self.gradient_state.sync_gradients condition. As a result, the optimizer_state["stage"] is never transitioned to OptState.READY.
optimizer_was_run in the call from transformers/trainer.py is (incorrectly?) set to True here.
On the next iteration, the double unscale_() error is raised since we call clip_grad_norm_ every time.

Edit: Fix on my fork is working.

PhilDakin on Jun 12, 2023

@younesbelkada I can confirm that using regular import works

Thanks for all

x4080 on Jun 14, 2023

@younesbelkada I actually still receive this error when running with 4bit quantization. Using the same installations and running this notebook I run into the same error. Note that without qlora I can run just fine (ie. the notebook linked here does work but this notebook does not).

Any idea why this could be?

Output of pip freeze is below for relevant libraries

accelerate @ git+https://github.com/huggingface/accelerate.git@eba6eb79dc2ab652cd8b44b37165a4852768a8ac
bitsandbytes==0.39.0
einops==0.6.1
loralib==0.1.1
peft @ git+https://github.com/huggingface/peft.git@7fb5f90a38cb39a31396de7e638ead9ecea692af
transformers @ git+https://github.com/huggingface/transformers.git@460b844360131c99d3dd4dbd9c08545ea2e6ac9e

sam-h-bean on Jun 5, 2023

I confirm it works although I have no idea why because the code looks the same, and I’ve used the same env twice. Anyway thanks for the help!

kafkasl on Jun 1, 2023