transformers: RuntimeError: unscale_() has already been called on this optimizer since the last update().
System Info
transformersversion: 4.30.0.dev0- Platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): 2.9.2 (True)
- Flax version (CPU?/GPU?/TPU?): 0.6.3 (gpu)
- Jax version: 0.4.1
- JaxLib version: 0.4.1
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: not sure, see colab below
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Run the colab: https://colab.research.google.com/drive/1ARmlaZZaKyAg6HTi57psFLPeh0hDRcPX?usp=sharing#scrollTo=Duak7T_B3VpJ
@younesbelkada @pacman100 I have reinstalled from source as suggested after the fix in (https://github.com/huggingface/transformers/pull/23914/files) but I still get the error. I’m in the latest commit transformers @ git+https://github.com/huggingface/transformers.git@fabe17a726bbf6081cfbcc975d8ac451a81f3e2d and you can tell from the stacktrace that the line numbers are different (due to the changes to fix the problem when using QLora). The script does not use QLora (afaik)
am I missing something?
Expected behavior
The call trainer.train() should work and instead produces the following exception:
│ in <cell line: 17>:17 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1661 in train │
│ │
│ 1658 │ │ inner_training_loop = find_executable_batch_size( │
│ 1659 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1660 │ │ ) │
│ ❱ 1661 │ │ return inner_training_loop( │
│ 1662 │ │ │ args=args, │
│ 1663 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1664 │ │ │ trial=trial, │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1995 in _inner_training_loop │
│ │
│ 1992 │ │ │ │ │ │ │ │ args.max_grad_norm, │
│ 1993 │ │ │ │ │ │ │ ) │
│ 1994 │ │ │ │ │ │ else: │
│ ❱ 1995 │ │ │ │ │ │ │ self.accelerator.clip_grad_norm_( │
│ 1996 │ │ │ │ │ │ │ │ model.parameters(), │
│ 1997 │ │ │ │ │ │ │ │ args.max_grad_norm, │
│ 1998 │ │ │ │ │ │ │ ) │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:1817 in clip_grad_norm_ │
│ │
│ 1814 │ │ │ # `accelerator.backward(loss)` is doing that automatically. Therefore, its i │
│ 1815 │ │ │ # We cannot return the gradient norm because DeepSpeed does it. │
│ 1816 │ │ │ return None │
│ ❱ 1817 │ │ self.unscale_gradients() │
│ 1818 │ │ return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type) │
│ 1819 │ │
│ 1820 │ def clip_grad_value_(self, parameters, clip_value): │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:1780 in unscale_gradients │
│ │
│ 1777 │ │ │ for opt in optimizer: │
│ 1778 │ │ │ │ while isinstance(opt, AcceleratedOptimizer): │
│ 1779 │ │ │ │ │ opt = opt.optimizer │
│ ❱ 1780 │ │ │ │ self.scaler.unscale_(opt) │
│ 1781 │ │
│ 1782 │ def clip_grad_norm_(self, parameters, max_norm, norm_type=2): │
│ 1783 │ │ """ │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py:275 in unscale_ │
│ │
│ 272 │ │ optimizer_state = self._per_optimizer_states[id(optimizer)] │
│ 273 │ │ │
│ 274 │ │ if optimizer_state["stage"] is OptState.UNSCALED: │
│ ❱ 275 │ │ │ raise RuntimeError("unscale_() has already been called on this optimizer sin │
│ 276 │ │ elif optimizer_state["stage"] is OptState.STEPPED: │
│ 277 │ │ │ raise RuntimeError("unscale_() is being called after step().") │
│ 278 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: unscale_() has already been called on this optimizer since the last update().
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 33 (8 by maintainers)
Had the same issue yesterday. Kept getting the error even though I updated the transformer library. I believe what helped was to
pip uninstall transformers, restart the kernel, and thenpip install -U git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9.Hello, able to reproduce this. cc @muellerzr
Reason: Gradient Accumulation in trainer is happening across the epochs because of
total_batched_samples. However, Accelerate resets step at the end of an epoch leading tosync_gradientsbeingFalseand optimizer not being run and when the next timeclip_grad_norm_is called, it leads tounscale_() has already been called on this optimizer since the last update().Hello everyone, the above PR #24415 should resolve the issues with grad_acc around epoch boundaries.
I am still able to reproduce this double
unscale_()issue with the original stack trace.Using Falcon-Guanaco.ipynb and making the following modifications:
dataset = dataset.shard(num_shards=80, index=0)before constructingSFTTrainer.max_seq_length = 512tomax_seq_length = 1024.After these modifications the
trainer.train()call reliably fails.Debugging I see the following steps happen before the error:
step()call inaccelerate/optimizer.pyreturns immediately because of the self.gradient_state.sync_gradients condition. As a result, theoptimizer_state["stage"]is never transitioned toOptState.READY.optimizer_was_runin the call fromtransformers/trainer.pyis (incorrectly?) set toTruehere.unscale_()error is raised since we callclip_grad_norm_every time.Edit: Fix on my fork is working.
@younesbelkada I can confirm that using regular import works
Thanks for all
@younesbelkada I actually still receive this error when running with 4bit quantization. Using the same installations and running this notebook I run into the same error. Note that without qlora I can run just fine (ie. the notebook linked here does work but this notebook does not).
Any idea why this could be?
Output of
pip freezeis below for relevant librariesI confirm it works although I have no idea why because the code looks the same, and I’ve used the same env twice. Anyway thanks for the help!