pytorch-lightning: Data loading hangs before first validation step
š Bug
After training epoch, before the first validation step, training gets stuck somewhere in the data loaders (I think).
I canāt provide a reproduction script unfortunately: Getting the training into the specific situation takes a long time (must train for long enough for the situation to arise).
I train on 4x 1080 Ti using DDP and num_workers=20. After the first training epoch, before the first validation, training gets stuck. All GPUs are reported to have 100% compute and memory utilization, but only 50/250 W power consumption. Only the 4 main Python threads seem to be doing any working (busy looping?). The 20 worker processes seem to have been stopped already.
To me it looks like the main threads are still busy waiting for new samples, while the dataloaders have already gone.
Note that I use limit_train_batches=0.1, maybe this is the cause?
Unfortunately I donāt have ptrace capability on the machine, so canāt use GDB etc. I printed the stack traces of all Python threads every 10s using a debugging thread. Logs of the hang situation are here: https://gist.github.com/jonashaag/b74ae9fc9267bde2cecd35ae316232c0
I am currently training without limit_train_batches to see if itās due to that setting. EDIT: No, I can also reproduce without limit_train_batches set.
Environment
* CUDA:
- GPU:
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- available: True
- version: 11.0
* Packages:
- numpy: 1.19.2
- pyTorch_debug: True
- pyTorch_version: 1.8.0.dev20201028
- pytorch-lightning: 0.10.0
- tqdm: 4.51.0
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.8
- version: #88~16.04.1-Ubuntu SMP Wed Feb 12 04:19:15 UTC 2020
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 4
- Comments: 27 (7 by maintainers)
Great. Iām actually still getting this even with ddp_spawn and num_workers=0 (FYI everyone else)
Same issue here, even with
num_workers=0andddp_spawnsettings. Anyone got a fix?The validation steps seems to be much longer than the training⦠did anyone find a solution to this issue
I also got the same behavior, PL is stuck in an infinite loop trying to get a batch, but it doesnāt reach the dataset and I really donāt know what to do. This is on CPU for me.
It seems that the code is stuck on an infinite loop in the pytorch
dataloader.pyfile in line 1147:It was fixed when I set
num_workers=0, but itās still weird because I used more workers before without issue.Iāve had the same issue and managed to fix it by setting the dataloader
persistent_workersparamater toTrue. Without it workers get killed at the end of each epoch, and then recreated at the start of the next which was so slow for me that training with 0 workers was faster. Not only did this occur at the start of each epoch but also when switching between training, validation and testing. With this option however itās definitely worth it to havenum_workers > 0andpin_memory = Trueas there is no more delay.