transformers: Error when setting a high batch-size: `AttributeError: 'NoneType' object has no attribute 'backward'`
System Info
Transformers version: latest@github Accelerate version: latest@github Deepspeed version: latest@github
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py
Use a high per_device_batch_size
and let Trainer
drop the batch size. Torchrun launcher with Deepspeed-Zero2.
[INFO|trainer.py:1786] 2023-06-28 09:03:54,973 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-06-28 09:03:54,973 >> Num examples = 338
[INFO|trainer.py:1788] 2023-06-28 09:03:54,973 >> Num Epochs = 4
[INFO|trainer.py:1789] 2023-06-28 09:03:54,973 >> Instantaneous batch size per device = 32
[INFO|trainer.py:1790] 2023-06-28 09:03:54,973 >> Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:1791] 2023-06-28 09:03:54,973 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1792] 2023-06-28 09:03:54,973 >> Total optimization steps = 8
[INFO|trainer.py:1793] 2023-06-28 09:03:54,974 >> Number of trainable parameters = 8,388,608
0%| | 0/8 [00:00<?, ?it/s][INFO|trainer.py:1786] 2023-06-28 09:04:12,933 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-06-28 09:04:12,933 >> Num examples = 338
[INFO|trainer.py:1788] 2023-06-28 09:04:12,934 >> Num Epochs = 4
[INFO|trainer.py:1789] 2023-06-28 09:04:12,934 >> Instantaneous batch size per device = 16
[INFO|trainer.py:1790] 2023-06-28 09:04:12,934 >> Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:1791] 2023-06-28 09:04:12,934 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1792] 2023-06-28 09:04:12,934 >> Total optimization steps = 12
[INFO|trainer.py:1793] 2023-06-28 09:04:12,936 >> Number of trainable parameters = 8,388,608
0%| | 0/8 [00:16<?, ?it/s]
Traceback (most recent call last):t/s]
File "/app/finetune.py", line 796, in <module>
main()
File "/app/finetune.py", line 732, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/memory.py", line 132, in decorator
return function(batch_size, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1938, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2770, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1849, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
AttributeError: 'NoneType' object has no attribute 'backward'
In this case, I set the per_device_train_batch_size
to 32 which is too large for an A100-80 (knowingly). Trainer drops the batch-size from 32 to 16 when it overflows (which is expected behavior) but then fails because of self.accelerator.backward(loss)
.
Don’t see this issue when I set a batch-size that fits the GPU, only when it overflows. I suspect accelerator.prepare
needs to be called again with the corrected batch-size.
Expected behavior
Trainer drops the batch size from 32 to 16 and training continues without failure.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 24 (7 by maintainers)
Thanks for the ping, I’ll take a look at this today or tommorow!
@mhillebrand it’ll be closed after we merge #28088 which adds the support in for auto batch size finder 😃
I use DeepSpeed (ZeRO-2) with both LoRA and QLoRA, and it works great—until I enable
auto_find_batch_size
.This is a niche issue? I feel like most people would rather make use of
auto_find_batch_size
and avoid OOM errors with ease. BTW, I was wrong. This problem does occur when finetuning Llama2 models.Thanks @ekkkkki for the context. I must have missed this. @muellerzr is this enough to go on or would you like more details?