pytorch-lightning: PyTorch Lightning ignores traditional WORLD_SIZE/RANK specifications in environment and doesn't document replacement
🐛 Bug
Standard torch.distributed environment variables seem to be handled differently.
To Reproduce
$ MASTER_ADDR=192.168.1.3 MASTER_PORT=1234 WORLD_SIZE=2 RANK=0 python3 boring_model.py
Expected behavior
Should wait for second job to connect to the MASTER_PORT. Instead, just starts training. No combination of arguments and environment variables seems to change this behavior.
It appears that PL handles startup for distributed processing differently somehow and internally overrides environment variables. This works pretty well for ddp on single nodes and (presumably) automates deployment under Slurm.
But I can’t figure out from the documentation or from the samples how to arrange for the traditional PyTorch behavior in which all jobs are started up manually.
For comparison, this code behaves as expected:
$ cat > simple.py
import torch
print("init")
torch.distributed.init_process_group("gloo")
print("done", torch.distributed.get_rank(), torch.distributed.get_world_size())
$ MASTER_ADDR=192.168.1.3 MASTER_PORT=1234 WORLD_SIZE=2 RANK=0 python3 simple.py & sleep 3; MASTER_ADDR=192.168.1.3 MASTER_PORT=1234 WORLD_SIZE=2 RANK=1 python3 simple.py
That is, the two jobs rendezvous as expected and then exit.
Suggested Behavior
- document how the traditional PyTorch behavior can be reproduced (i.e., no calculations of nodes/size/… in DDP)
- perhaps provide some kind of command line flag that restores the traditional behavior (e.g., “–accelerator ddp_plain”)
- maybe something else
Environment
- PyTorch Version (e.g., 1.0): 1.8.1
- OS (e.g., Linux): Ubuntu 20.04
- How you installed PyTorch (
conda,pip, source): pip3 - Build command you used (if compiling from source):
- Python version: 3.8
- CUDA/cuDNN version: 11.1
- GPU models and configuration: 1080
- Any other relevant information:
Additional context
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 25 (13 by maintainers)
@tmbdev In Lightning you don’t need to set the global, local and world size. It gets computed automatically. You only have to launch the script on every node and set the NODE_RANK like so (and set the
num_nodestrainer argument):NODE_RANK=1 MASTER_ADDR=192.168.1.3 MASTER_PORT=1234 python your_script.pyApologies, this was introduced recently by me. We can change the name when we do the property/setter refactor. #6303
Hey @tmbdev,
Looking into this asap ! Thanks for sharing a reproducible script 😃
By the way, you might want to demo Flash too in your examples: https://github.com/PyTorchLightning/lightning-flash
And would you mind join PyTorch Lighting Slack too: https://img.shields.io/badge/slack-chat-green.svg?logo=slack)](https://join.slack.com/t/pytorch-lightning/shared_invite/zt-f6bl2l0l-JYMK3tbAgAmGRrlNr00f1A.
In Flash, we are building a pretty dope DataPipeline: https://lightning-flash.readthedocs.io/en/latest/general/data.html and would love to provide a native support for WebDataset.
We will upstream everything to Lighting when stable 😃
Best, T.C
@awaelchli the
creates_childrennaming is confusing. many readers would assume its the environment within lightning that creates children processes vs processes being created before trainer.fit()