pytorch-lightning: Data loading hangs before first validation step

🐛 Bug

After training epoch, before the first validation step, training gets stuck somewhere in the data loaders (I think).

I can’t provide a reproduction script unfortunately: Getting the training into the specific situation takes a long time (must train for long enough for the situation to arise).

I train on 4x 1080 Ti using DDP and num_workers=20. After the first training epoch, before the first validation, training gets stuck. All GPUs are reported to have 100% compute and memory utilization, but only 50/250 W power consumption. Only the 4 main Python threads seem to be doing any working (busy looping?). The 20 worker processes seem to have been stopped already.

To me it looks like the main threads are still busy waiting for new samples, while the dataloaders have already gone.

Note that I use limit_train_batches=0.1, maybe this is the cause?

Unfortunately I don’t have ptrace capability on the machine, so can’t use GDB etc. I printed the stack traces of all Python threads every 10s using a debugging thread. Logs of the hang situation are here: https://gist.github.com/jonashaag/b74ae9fc9267bde2cecd35ae316232c0

I am currently training without limit_train_batches to see if it’s due to that setting. EDIT: No, I can also reproduce without limit_train_batches set.

Environment

* CUDA:
        - GPU:
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
        - available:         True
        - version:           11.0
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     True
        - pyTorch_version:   1.8.0.dev20201028
        - pytorch-lightning: 0.10.0
        - tqdm:              4.51.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.7.8
        - version:           #88~16.04.1-Ubuntu SMP Wed Feb 12 04:19:15 UTC 2020

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 4
Comments: 27 (7 by maintainers)

Most upvoted comments

Great. I’m actually still getting this even with ddp_spawn and num_workers=0 (FYI everyone else)

jamesjjcondon on Mar 31, 2021

Same issue here, even with num_workers=0 and ddp_spawn settings. Anyone got a fix?

aasharma90 on Jun 7, 2022

The validation steps seems to be much longer than the training… did anyone find a solution to this issue

sounakdey on Feb 8, 2022

I also got the same behavior, PL is stuck in an infinite loop trying to get a batch, but it doesn’t reach the dataset and I really don’t know what to do. This is on CPU for me.

It seems that the code is stuck on an infinite loop in the pytorch dataloader.py file in line 1147:

            while True:
                success, data = self._try_get_data()
                if success:
                    return data

It was fixed when I set num_workers=0, but it’s still weird because I used more workers before without issue.

NadavLightricks on Mar 25, 2021

I’ve had the same issue and managed to fix it by setting the dataloader persistent_workers paramater to True. Without it workers get killed at the end of each epoch, and then recreated at the start of the next which was so slow for me that training with 0 workers was faster. Not only did this occur at the start of each epoch but also when switching between training, validation and testing. With this option however it’s definitely worth it to have num_workers > 0 and pin_memory = True as there is no more delay.

KuSi833 on Sep 7, 2022