transformers: Training hangs at the very start while using deepspeed

Environment info

  • transformers version: 4.4.0
  • base docker image: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
  • Python version: 3.8.8
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): 2.2.1 (True)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes, using deepspeed

Who can help

@stas00 for deepspeed

Information

Model I am using Layoutlm:

I need to test my layoutlm model by training it only 1 epoch due to test purposes. However, training hangs at the very start without logging anything or returning an error message. When I disable deepspeed and launch my training with python -m torch.distributed.launch instead of deepspeed --num_gpus={torch.cuda.device_count()} --num_nodes=1, I manage to train for 1 epoch.

The tasks I am working on is:

  • Token Classification

To reproduce

I think it is a general issue. So, training any model with deepspeed for only one epoch may result in hanging process.

Expected behavior

It would be possible to train a model only for 1 epoch not to waste time while testing.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 18 (14 by maintainers)

Most upvoted comments

Thanks @stas00 for your kind help. Currently, I don’t have time to dive into this issue as I manage to run in a distributed setting without deepspeed, it is not so urgent for now. On the other hand, I will be working on this issue in the next coming weeks.

So you have a syncing problem, the 2 gpus run barrier which ensures they arrived to the same point, but one of the gpus doesn’t, and so the other is stuck waiting for it.

Are you by chance misconfiguring the launch command? Try to hardcode 2 here:

deepspeed --num_gpus={torch.cuda.device_count()} --num_nodes=1

could {torch.cuda.device_count() be returning a different number than 2?

i.e.:

deepspeed --num_gpus=2 --num_nodes=1

Also consider using these tools to diagnose the hanging:

  • py-spy:
# trace a running python application - e.g. when it's hanging or very slow and you want to see the backtrace 
pip install py-spy
# dumps traceback for each thread
sudo py-spy dump --pid PID # sudo may or may not be needed
  • faulthandler
# make the traceback dumped periodically - every X seconds
import faulthandler
faulthandler.dump_traceback_later(20, repeat=True)

Thank you @stas00 for your rapid response. I thought that it may be a general issue, that’s why I didn’t provide any example code. The code now I am working on is a confidential one, I will follow your advice and let you know afterward.