transformers: Error when setting a high batch-size: `AttributeError: 'NoneType' object has no attribute 'backward'`

System Info

Transformers version: latest@github Accelerate version: latest@github Deepspeed version: latest@github

Who can help?

@pacman100 @sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py

Use a high per_device_batch_size and let Trainer drop the batch size. Torchrun launcher with Deepspeed-Zero2.

[INFO|trainer.py:1786] 2023-06-28 09:03:54,973 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-06-28 09:03:54,973 >>   Num examples = 338
[INFO|trainer.py:1788] 2023-06-28 09:03:54,973 >>   Num Epochs = 4
[INFO|trainer.py:1789] 2023-06-28 09:03:54,973 >>   Instantaneous batch size per device = 32
[INFO|trainer.py:1790] 2023-06-28 09:03:54,973 >>   Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:1791] 2023-06-28 09:03:54,973 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1792] 2023-06-28 09:03:54,973 >>   Total optimization steps = 8
[INFO|trainer.py:1793] 2023-06-28 09:03:54,974 >>   Number of trainable parameters = 8,388,608
  0%|          | 0/8 [00:00<?, ?it/s][INFO|trainer.py:1786] 2023-06-28 09:04:12,933 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-06-28 09:04:12,933 >>   Num examples = 338
[INFO|trainer.py:1788] 2023-06-28 09:04:12,934 >>   Num Epochs = 4
[INFO|trainer.py:1789] 2023-06-28 09:04:12,934 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1790] 2023-06-28 09:04:12,934 >>   Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:1791] 2023-06-28 09:04:12,934 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1792] 2023-06-28 09:04:12,934 >>   Total optimization steps = 12
[INFO|trainer.py:1793] 2023-06-28 09:04:12,936 >>   Number of trainable parameters = 8,388,608
  0%|          | 0/8 [00:16<?, ?it/s]
Traceback (most recent call last):t/s]
  File "/app/finetune.py", line 796, in <module>
    main()
  File "/app/finetune.py", line 732, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/memory.py", line 132, in decorator
    return function(batch_size, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2770, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1849, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
AttributeError: 'NoneType' object has no attribute 'backward'

In this case, I set the per_device_train_batch_size to 32 which is too large for an A100-80 (knowingly). Trainer drops the batch-size from 32 to 16 when it overflows (which is expected behavior) but then fails because of self.accelerator.backward(loss).

Don’t see this issue when I set a batch-size that fits the GPU, only when it overflows. I suspect accelerator.prepare needs to be called again with the corrected batch-size.

Expected behavior

Trainer drops the batch size from 32 to 16 and training continues without failure.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 24 (7 by maintainers)

Most upvoted comments

Thanks for the ping, I’ll take a look at this today or tommorow!

muellerzr on Aug 31, 2023

@mhillebrand it’ll be closed after we merge #28088 which adds the support in for auto batch size finder 😃

muellerzr on Jan 9, 2024

I don’t think qlora is supported with DeepSpeed.

I use DeepSpeed (ZeRO-2) with both LoRA and QLoRA, and it works great—until I enable auto_find_batch_size.

Thanks for providing more details. This is a niche issue and based on the available bandwidth, we will prioritize it.

This is a niche issue? I feel like most people would rather make use of auto_find_batch_size and avoid OOM errors with ease. BTW, I was wrong. This problem does occur when finetuning Llama2 models.

mhillebrand on Dec 15, 2023

Thanks @ekkkkki for the context. I must have missed this. @muellerzr is this enough to go on or would you like more details?

orangetin on Aug 31, 2023