transformers: Seq2seq now has larger memory requirements, OOM w/Deepspeed on previously runnable models

(A continuation of #10149 , since it looks like it’s a broader issue:)

It looks like seq2seq has changed in the past week, and now gives out-of-memory errors for @stas00 's impressive recent DeepSpeed work that allowed training/predicting e.g. T5-11B on a single 40GB card.

Here’s a simple repeatable example using the newer scripts:

Run script:

export OUTPUTDIR=tst-summarization
export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \
    --model_name_or_path allenai/unifiedqa-t5-11b \
    --do_train \
    --do_eval \
    --do_predict \
    --task summarization \
    --dataset_name xsum \
    --output_dir $OUTPUTDIR \
    --per_device_train_batch_size=$BS \
    --per_device_eval_batch_size=$BS \
    --overwrite_output_dir \
    --predict_with_generate \
    --max_train_samples 500 \
    --max_val_samples 100 \
    --max_test_samples 100 \

(One note: Should I be adding a --deepspeed option as with the old finetune_trainer.py (I am not seeing it in the list of options)? And if so, should it be pointing to the new location for the config file ( …/tests/deepspeed/ds_config.json ), or does it use this location by default?)

Conda Environment:

# Make new environment
conda create --name transformers-feb12-2021 python=3.8
conda activate transformers-feb12-2021

# Clone transformers
git clone https://github.com/huggingface/transformers.git
cd transformers

# Install nightly build of Pytorch
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html -U

# Install seq2seq transformers requirements
pip install -r examples/seq2seq/requirements.txt

# Install transformers
pip install -e .

# Install DeepSpeed from source for the A100 support
cd ..
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed/
# Checkout release for DeepSpeed 0.3.10 (to avoid AMD bug in latest)
git checkout c14b839d9
./install.sh
pip install .

Error:

...
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 2; 39.59 GiB total capacity; 37.87 GiB already allocated; 40.69 MiB free; 37.88 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "./run_seq2seq.py", line 629, in <module>
    main()
  File "./run_seq2seq.py", line 543, in main
    trainer = Seq2SeqTrainer(
  File "/home/pajansen/github/transformers-feb12-2021/transformers/src/transformers/trainer.py", line 276, in __init__
    model = model.to(args.device)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 3; 39.59 GiB total capacity; 37.87 GiB already allocated; 40.69 MiB free; 37.88 GiB reserved in total by PyTorch)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 22 (14 by maintainers)

Most upvoted comments

another update: DS currently locks one in if one wants to be able to access the fp32 model, see https://github.com/microsoft/DeepSpeed/issues/797 once they add a method to extract the fp32 model https://github.com/microsoft/DeepSpeed/issues/800 then we can sort this out.

Thank you for the details, @PeterAJansen - hoping to validate later in the day, but meanwhile this PR should solve it https://github.com/huggingface/transformers/pull/10243 (i.e. instead of the patch I sent last night).

edit PR merged, so master should be OK.

Thank you for elucidating your particular situation, @PeterAJansen

I’m going to run some experiments on fp16 eval against fp32 for t5 w/ wmt and we shall see. If it works well, then we can make fp16-eval available in the Trainer for those who want to try it.