transformers: Seq2seq now has larger memory requirements, OOM w/Deepspeed on previously runnable models
(A continuation of #10149 , since it looks like it’s a broader issue:)
It looks like seq2seq has changed in the past week, and now gives out-of-memory errors for @stas00 's impressive recent DeepSpeed work that allowed training/predicting e.g. T5-11B on a single 40GB card.
Here’s a simple repeatable example using the newer scripts:
Run script:
export OUTPUTDIR=tst-summarization
export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \
--model_name_or_path allenai/unifiedqa-t5-11b \
--do_train \
--do_eval \
--do_predict \
--task summarization \
--dataset_name xsum \
--output_dir $OUTPUTDIR \
--per_device_train_batch_size=$BS \
--per_device_eval_batch_size=$BS \
--overwrite_output_dir \
--predict_with_generate \
--max_train_samples 500 \
--max_val_samples 100 \
--max_test_samples 100 \
(One note: Should I be adding a --deepspeed option as with the old finetune_trainer.py (I am not seeing it in the list of options)? And if so, should it be pointing to the new location for the config file ( …/tests/deepspeed/ds_config.json ), or does it use this location by default?)
Conda Environment:
# Make new environment
conda create --name transformers-feb12-2021 python=3.8
conda activate transformers-feb12-2021
# Clone transformers
git clone https://github.com/huggingface/transformers.git
cd transformers
# Install nightly build of Pytorch
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html -U
# Install seq2seq transformers requirements
pip install -r examples/seq2seq/requirements.txt
# Install transformers
pip install -e .
# Install DeepSpeed from source for the A100 support
cd ..
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed/
# Checkout release for DeepSpeed 0.3.10 (to avoid AMD bug in latest)
git checkout c14b839d9
./install.sh
pip install .
Error:
...
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 2; 39.59 GiB total capacity; 37.87 GiB already allocated; 40.69 MiB free; 37.88 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "./run_seq2seq.py", line 629, in <module>
main()
File "./run_seq2seq.py", line 543, in main
trainer = Seq2SeqTrainer(
File "/home/pajansen/github/transformers-feb12-2021/transformers/src/transformers/trainer.py", line 276, in __init__
model = model.to(args.device)
File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
return self._apply(convert)
File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
param_applied = fn(param)
File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 3; 39.59 GiB total capacity; 37.87 GiB already allocated; 40.69 MiB free; 37.88 GiB reserved in total by PyTorch)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 22 (14 by maintainers)
another update: DS currently locks one in if one wants to be able to access the fp32 model, see https://github.com/microsoft/DeepSpeed/issues/797 once they add a method to extract the fp32 model https://github.com/microsoft/DeepSpeed/issues/800 then we can sort this out.
Thank you for the details, @PeterAJansen - hoping to validate later in the day, but meanwhile this PR should solve it https://github.com/huggingface/transformers/pull/10243 (i.e. instead of the patch I sent last night).
edit PR merged, so master should be OK.
Thank you for elucidating your particular situation, @PeterAJansen
I’m going to run some experiments on fp16 eval against fp32 for t5 w/ wmt and we shall see. If it works well, then we can make fp16-eval available in the Trainer for those who want to try it.