pytorch-lightning: Multi GPU training (ddp) gets very slow when using list of tensors in Dataset
π Bug
We are migrating to PyTorch Lightning from a custom implementation using Torchbearer before.
Our dataset stores a list of PyTorch tensors in memory, because the tensors are all of different dimensions. When migrating to PyTorch Lightning from a custom implementation, this seems to slow our training down in the multi GPU setup very significantly (training twice as long as before!).
To Reproduce
I built a repository which has just random data and a straightforward architecture to reproduce this both with (minimal.py) and without PyTorch Lightning (custom.py).
The repository and further details are located here: https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis
When training with PyTorch Lightning for only 10 epochs, it takes 105 seconds when using a big PyTorch tensor (without a list), but it increases to 310 seconds (3x slower) when using a list of tensors. The data size and model is exactly the same, itβs just that one time itβs stored differently. When using my custom implementation, no such effect is observed (takes 97-98 seconds no matter if with or without lists).
To run the PyTorch Lightning version use:
python minimal.py --gpus 4 # Baseline
python minimal.py --gpus 4 --use_list # Extremely slow
One important thing to note: when using the list approach it seems that every tensor of that list is stored as a separate filesystem memory pointer, so you might need to increase your file limit: ulimit -n 99999.
It seems that this is the issue that the DataLoader gets very slow as it needs to read so many files?
Is there a way around this?
Code sample
See https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis
Expected behavior
I would expect the same dataset stored as a list of tensors to also train quickly.
Environment
- CUDA:
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.16.4
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.7-dev
- tensorboard: 1.14.0
- tqdm: 4.46.0
- System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.7.3
- version: #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 18 (9 by maintainers)
This is working well now. The only thing is that itβs not using shared memory anymore, so the memory usage (not of the GPU, but the system itself) is 4 times higher (duplicated for each process). However, it was like this before in my custom implementation as well, so I think this is expected.
Thank you for taking the time to fix this!
Yes, the dataloaders are the same. They are plain PyTorch dataloaders and use the same number of workers etc.
The issue seems to be that the distributed multiprocessing is opening a file pointer for each tensor in the list and then the access (opening all those 20000 file pointers) degrades the performance.
This is also the reason why you need to increase the ulimit -n when running the list version.
You can see that in the list version, starting each epoch takes a long time, so the data loading aspects just take much longer.