pytorch-lightning: pl+wandb: Hanging during "cleaning up ddp environment" when using DDPSpawnPlugin + WandbLogger
🐛 Bug
When using an accelerator that bascially uses a “spawn” start method for multiprocessing (rather than Linux default “fork”), any program that actually spawns a new worker (num_processes>1) seems to hang upon cleanup.
Concretely, I’ve only seen this when:
- Accelerator is either
ddp_cpuorddp_spawn; AND - WandbLogger is instantiated (and I guess used for training)
Please reproduce using the BoringModel
My model (ConstantMultiply) is more boring 😉
To Reproduce
Clone this repo (it’s small), and then run the example: https://github.com/EricCousineau-TRI/repro/tree/cae9aa31f07f90c4cfb3b908fe84107e102ab06f/python/wandb_pytorch_lightning_combo
git clone https://github.com/EricCousineau-TRI/repro
cd repro
git checkout cae9aa31f07f90c4cfb3b908fe84107e102ab06f
cd python/wandb_pytorch_lightning_combo
./isolate.sh ./setup.sh ./train_wandb_pl_main.py
Ignore the stuff about sweeps for now (I can make less noisy dir if you want).
Expected behavior
It doesn’t freeze?
Environment
- PyTorch Version (e.g., 1.0): 1.7.1
- OS (e.g., Linux): Linux, Ubuntu, 18.04
- How you installed PyTorch: pip
- Build command you used (if compiling from source): N/A
- Python version: 3.6.9, 3.8.0
- CUDA/cuDNN version: N/A
- GPU models and configuration: N/A
- Any other relevant information: 😢
Additional context
It would be nice to have a “fork” version of DDP for CPU, so that way we can test things more easily for the suggested mode of DDP for GPU (per the PL docs, at least as of a couple of days ago).
If I use ddp, that means that colleagues who have only 1 GPU cannot test it, which hurts development, b/c the intended abstractions of pl breakdown 😿
(When trying with num_processes=2, gpus=[0], it just reduces the number of workers, so then we don’t test those branches…)
The interactions between wandb and pl are a bit non-trivial, esp. if we want to try things like sweeps. We can hack around it, but jeepers it feels like flailing when doing it on the full setup.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 26 (15 by maintainers)
Just a note that we are still actively working on it
This should now be solved with latest lightning master branch.
Yes, a new experimental handling of multiprocessing environments is coming up and is being tested. Hopefully it will be ready this month.
Still open and being worked on
Is this still being worked on?
I encountered same, but it was b/c my system was not properly configured. My
~/.gitconfighadlfssection (carryover from prior install), but I did not havegit-lfsinstalled.My solution was to remove the offending section in the config. You may want to consider commenting out or
git lfs uninstall, but consider backing up your config. (I version control mine.) Be sure to read docs.Confirmed - thanks!!! https://github.com/EricCousineau-TRI/repro/commit/9aef82f1
This issue has been automatically marked as stale because it hasn’t had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!