transformers: UL2 Training with HF Trainer + DeepSpeed Zero3 Results in CUDA Illegal Memory Exception
System Info
transformers version==4.26.0 torch==1.13.1 deepspeed==0.8 hardware: 8x A100-80GB
Fine-tuning UL2 with the Huggingface Trainer and DeepSpeed Zero2 or Zero3 results in a CUDA Illegal Memory Exception. This is true with any Huggingface Trainer script, PyTorch version (1.12 and 1.113), DeepSpeed version (0.6.7, 0.7.7, 0.8), and CUDA version (11.3 and 11.8) that I’ve tried. The same scripts work just fine with flan-t5-xxl.
[W CUDAGuardImpl.h:124] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Any thoughts @stas00? Your help would be appreciated.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Try fine-tuning UL2 on any task/dataset using DeepSpeed Zero2/Zero3. You should encounter the error.
Expected behavior
Training proceeds normally.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 35 (19 by maintainers)
Thank you so much, Stas. You’re right that
sub_group_sizeis 1e9 in the HF DeepSpeed integration docs, but there’s a sample config with 1e12 on the DeepSpeed ZeRO doc page (https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training) and I think that’s where I got it from. I’ll open up an issue in DeepSpeed. Thanks again for going above and beyond.I’m requesting to make this recurring experience of embedding lookup explosion on cuda to be less painful for the users here: https://github.com/pytorch/pytorch/issues/93880
but what did you change to fix the smaller one? I hope you didn’t use my
%hack - it was just to show you what the problem was - it of course wasn’t meant to be a solution - apologies if it wasn’t obvious.the larger model is most likely has a different vocab size, so you really need to figure out your setup to read the config correctly and get the tokenizer set up right - usually this is mostly done for you, but this is where you’d check since you wrote your custom code.
First make this small model work correctly w/o hardcoding any numbers - then move onto the large one and most likely it’ll just work.
Thank you so much, Stas!
yes, they are ok at the
outputs = model(**inputs)frame and then are borked at the point of dropout, but this happens much sooner,. I will have a look.It breaks somewhere inside
T5Stack.forwardRunning with
--model Finnish-NLP/ul2-small-nl24-finnishworks for me as well with any number of gpus (from 1 to 8).But I don’t think it’s representative because it uses a different activation function than google/ul2. Unfortunately there are no “real” smaller UL2 models, unlike the flan-t5 series where everything is the same except for scale.
UPDATE: I take that back. yhavinga/ul2-base-en-nl also uses gated-silu. Running that experiment now.