transformers: Training hangs at the very start while using deepspeed
Environment info
transformersversion: 4.4.0- base docker image: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
- Python version: 3.8.8
- PyTorch version (GPU?): 1.7.1 (True)
- Tensorflow version (GPU?): 2.2.1 (True)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes, using deepspeed
Who can help
@stas00 for deepspeed
Information
Model I am using Layoutlm:
I need to test my layoutlm model by training it only 1 epoch due to test purposes. However, training hangs at the very start without logging anything or returning an error message. When I disable deepspeed and launch my training with python -m torch.distributed.launch instead of deepspeed --num_gpus={torch.cuda.device_count()} --num_nodes=1, I manage to train for 1 epoch.
The tasks I am working on is:
- Token Classification
To reproduce
I think it is a general issue. So, training any model with deepspeed for only one epoch may result in hanging process.
Expected behavior
It would be possible to train a model only for 1 epoch not to waste time while testing.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 18 (14 by maintainers)
Thanks @stas00 for your kind help. Currently, I don’t have time to dive into this issue as I manage to run in a distributed setting without deepspeed, it is not so urgent for now. On the other hand, I will be working on this issue in the next coming weeks.
So you have a syncing problem, the 2 gpus run
barrierwhich ensures they arrived to the same point, but one of the gpus doesn’t, and so the other is stuck waiting for it.Are you by chance misconfiguring the launch command? Try to hardcode
2here:could
{torch.cuda.device_count()be returning a different number than 2?i.e.:
Also consider using these tools to diagnose the hanging:
faulthandlerThank you @stas00 for your rapid response. I thought that it may be a general issue, that’s why I didn’t provide any example code. The code now I am working on is a confidential one, I will follow your advice and let you know afterward.