transformers: UL2 Training with HF Trainer + DeepSpeed Zero3 Results in CUDA Illegal Memory Exception

System Info

transformers version==4.26.0 torch==1.13.1 deepspeed==0.8 hardware: 8x A100-80GB

Fine-tuning UL2 with the Huggingface Trainer and DeepSpeed Zero2 or Zero3 results in a CUDA Illegal Memory Exception. This is true with any Huggingface Trainer script, PyTorch version (1.12 and 1.113), DeepSpeed version (0.6.7, 0.7.7, 0.8), and CUDA version (11.3 and 11.8) that I’ve tried. The same scripts work just fine with flan-t5-xxl.

[W CUDAGuardImpl.h:124] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Any thoughts @stas00? Your help would be appreciated.

Who can help?

@stas00

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Try fine-tuning UL2 on any task/dataset using DeepSpeed Zero2/Zero3. You should encounter the error.

Expected behavior

Training proceeds normally.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 35 (19 by maintainers)

Most upvoted comments

Thank you so much, Stas. You’re right that sub_group_size is 1e9 in the HF DeepSpeed integration docs, but there’s a sample config with 1e12 on the DeepSpeed ZeRO doc page (https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training) and I think that’s where I got it from. I’ll open up an issue in DeepSpeed. Thanks again for going above and beyond.

I’m requesting to make this recurring experience of embedding lookup explosion on cuda to be less painful for the users here: https://github.com/pytorch/pytorch/issues/93880

but what did you change to fix the smaller one? I hope you didn’t use my % hack - it was just to show you what the problem was - it of course wasn’t meant to be a solution - apologies if it wasn’t obvious.

the larger model is most likely has a different vocab size, so you really need to figure out your setup to read the config correctly and get the tokenizer set up right - usually this is mostly done for you, but this is where you’d check since you wrote your custom code.

First make this small model work correctly w/o hardcoding any numbers - then move onto the large one and most likely it’ll just work.

Thank you so much, Stas!

yes, they are ok at the outputs = model(**inputs) frame and then are borked at the point of dropout, but this happens much sooner,. I will have a look.

It breaks somewhere inside T5Stack.forward

Running with --model Finnish-NLP/ul2-small-nl24-finnish works for me as well with any number of gpus (from 1 to 8).

But I don’t think it’s representative because it uses a different activation function than google/ul2. Unfortunately there are no “real” smaller UL2 models, unlike the flan-t5 series where everything is the same except for scale.

UPDATE: I take that back. yhavinga/ul2-base-en-nl also uses gated-silu. Running that experiment now.