pytorch-lightning: Multi GPU training (ddp) gets very slow when using list of tensors in Dataset

🐛 Bug

We are migrating to PyTorch Lightning from a custom implementation using Torchbearer before.

Our dataset stores a list of PyTorch tensors in memory, because the tensors are all of different dimensions. When migrating to PyTorch Lightning from a custom implementation, this seems to slow our training down in the multi GPU setup very significantly (training twice as long as before!).

To Reproduce

I built a repository which has just random data and a straightforward architecture to reproduce this both with (minimal.py) and without PyTorch Lightning (custom.py).

The repository and further details are located here: https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis

When training with PyTorch Lightning for only 10 epochs, it takes 105 seconds when using a big PyTorch tensor (without a list), but it increases to 310 seconds (3x slower) when using a list of tensors. The data size and model is exactly the same, it’s just that one time it’s stored differently. When using my custom implementation, no such effect is observed (takes 97-98 seconds no matter if with or without lists).

To run the PyTorch Lightning version use:

python minimal.py --gpus 4  # Baseline
python minimal.py --gpus 4 --use_list  # Extremely slow

One important thing to note: when using the list approach it seems that every tensor of that list is stored as a separate filesystem memory pointer, so you might need to increase your file limit: ulimit -n 99999. It seems that this is the issue that the DataLoader gets very slow as it needs to read so many files? Is there a way around this?

Code sample

See https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis

Expected behavior

I would expect the same dataset stored as a list of tensors to also train quickly.

Environment

CUDA:
- GPU:
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
- available: True
- version: 10.1
Packages:
- numpy: 1.16.4
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.7-dev
- tensorboard: 1.14.0
- tqdm: 4.46.0
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.3
- version: #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 18 (9 by maintainers)

Most upvoted comments

This is working well now. The only thing is that it’s not using shared memory anymore, so the memory usage (not of the GPU, but the system itself) is 4 times higher (duplicated for each process). However, it was like this before in my custom implementation as well, so I think this is expected.

Thank you for taking the time to fix this!

mpaepper on Jun 3, 2020

Yes, the dataloaders are the same. They are plain PyTorch dataloaders and use the same number of workers etc.

The issue seems to be that the distributed multiprocessing is opening a file pointer for each tensor in the list and then the access (opening all those 20000 file pointers) degrades the performance.

This is also the reason why you need to increase the ulimit -n when running the list version.

You can see that in the list version, starting each epoch takes a long time, so the data loading aspects just take much longer.

mpaepper on May 24, 2020