transformers: OOM when trying to fine tune patrickvonplaten/led-large-16384-pubmed
I’m currently following this notebook but instead I’m using patrickvonplaten/led-large-16384-pubmed
tokenizer = AutoTokenizer.from_pretrained("patrickvonplaten/led-large-16384-pubmed",)
led = AutoModelForSeq2SeqLM.from_pretrained(
"patrickvonplaten/led-large-16384-pubmed",
gradient_checkpointing=True,
use_cache=False,
)
instead of allenai/led-large-16384 as the base model and tokenizer. I’m also using my own train/test data. With the exception of that, I kept everything else the same/consistent to that notebook as far as fine tuning. However, I’m running into OOM errors
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 13.96 GiB already allocated; 20.00 MiB free; 14.56 GiB reserved in total by PyTorch)
0%| | 0/3 [00:10<?, ?it/s]
on a couple ofTesla V100-SXM2-16GB and I’m not sure why that might be. The batch_size=2 seems pretty small and I also set gradient_checkpoint=True. @patrickvonplaten and/or the surrounding community, I’d greatly appreciate any help with this
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 29 (18 by maintainers)
The model is actually quite big so I would expect it to OOM, if you are doing multi GPU training, you could try
fairscale/deepspeedintegration for saving memory and speeding up the training, check out this blog post https://huggingface.co/blog/zero-deepspeed-fairscaleok, figured it out - I suggested for you try to disable the gradient checkpointing in the context of being unable to use Deepspeed, but I didn’t think of asking you to restore this config…
So enable
from_pretrained(MODEL_NAME, gradient_checkpointing=True,...And voila, this config works just fine:
You can go for even larger length, it should have a very small impact. And I think your batch size can now be even larger, so that you can remove
gradient_accumulation_stepsif wanted - or reduce it.I updated the notebook, so you can see it working: https://colab.research.google.com/drive/1rEspdkR839xZzh561OwSYLtFnnKhQdEl?usp=sharing
Glad to hear you were able to make progress, @mmoya01
What was the command line you used to launch this program? You have to launch it via
deepspeedas the docs instruct.edit: actually just learned that it doesn’t have to be the case - will update the docs shortly, but I still need to know how you started the program. thank you.
This is odd that you had to do it manually, DeepSpeed’s pip installer should have installed all the dependencies automatically.
I will see if I can reproduce that.
Have you tried w/o gradient checking?
The failure is not in the transformers land so it’s a bit hard to guess what has happened.
I’d recommend filing an Issue with DeepSpeed: https://github.com/microsoft/DeepSpeed/issues