pytorch-lightning: pl+wandb: Hanging during "cleaning up ddp environment" when using DDPSpawnPlugin + WandbLogger

🐛 Bug

When using an accelerator that bascially uses a “spawn” start method for multiprocessing (rather than Linux default “fork”), any program that actually spawns a new worker (num_processes>1) seems to hang upon cleanup.

Concretely, I’ve only seen this when:

Accelerator is either ddp_cpu or ddp_spawn; AND
WandbLogger is instantiated (and I guess used for training)

Please reproduce using the BoringModel

My model (ConstantMultiply) is more boring 😉

To Reproduce

Clone this repo (it’s small), and then run the example: https://github.com/EricCousineau-TRI/repro/tree/cae9aa31f07f90c4cfb3b908fe84107e102ab06f/python/wandb_pytorch_lightning_combo

git clone https://github.com/EricCousineau-TRI/repro
cd repro
git checkout cae9aa31f07f90c4cfb3b908fe84107e102ab06f
cd python/wandb_pytorch_lightning_combo
./isolate.sh ./setup.sh ./train_wandb_pl_main.py

Ignore the stuff about sweeps for now (I can make less noisy dir if you want).

Expected behavior

It doesn’t freeze?

Environment

PyTorch Version (e.g., 1.0): 1.7.1
OS (e.g., Linux): Linux, Ubuntu, 18.04
How you installed PyTorch: pip
Build command you used (if compiling from source): N/A
Python version: 3.6.9, 3.8.0
CUDA/cuDNN version: N/A
GPU models and configuration: N/A
Any other relevant information: 😢

Additional context

It would be nice to have a “fork” version of DDP for CPU, so that way we can test things more easily for the suggested mode of DDP for GPU (per the PL docs, at least as of a couple of days ago).

If I use ddp, that means that colleagues who have only 1 GPU cannot test it, which hurts development, b/c the intended abstractions of pl breakdown 😿 (When trying with num_processes=2, gpus=[0], it just reduces the number of workers, so then we don’t test those branches…)

The interactions between wandb and pl are a bit non-trivial, esp. if we want to try things like sweeps. We can hack around it, but jeepers it feels like flailing when doing it on the full setup.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 4
Comments: 26 (15 by maintainers)

Most upvoted comments

Just a note that we are still actively working on it

borisdayma on Apr 1, 2021

This should now be solved with latest lightning master branch.

pip install --upgrade wandb
pip install --upgrade git+https://github.com/PytorchLightning/pytorch-lightning.git

borisdayma on Feb 21, 2022

Yes, a new experimental handling of multiprocessing environments is coming up and is being tested. Hopefully it will be ready this month.

borisdayma on Oct 5, 2021

Still open and being worked on

borisdayma on May 1, 2021

Is this still being worked on?

EricCousineau-TRI on Jun 9, 2021

I encountered same, but it was b/c my system was not properly configured. My ~/.gitconfig had lfs section (carryover from prior install), but I did not have git-lfs installed.

My solution was to remove the offending section in the config. You may want to consider commenting out or git lfs uninstall, but consider backing up your config. (I version control mine.) Be sure to read docs.

EricCousineau-TRI on Mar 10, 2022

Confirmed - thanks!!! https://github.com/EricCousineau-TRI/repro/commit/9aef82f1

EricCousineau-TRI on Feb 22, 2022

This issue has been automatically marked as stale because it hasn’t had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

stale[bot] on Mar 26, 2021