pytorch-lightning: PyTorch Lightning ignores traditional WORLD_SIZE/RANK specifications in environment and doesn't document replacement

🐛 Bug

Standard torch.distributed environment variables seem to be handled differently.

To Reproduce

$ MASTER_ADDR=192.168.1.3 MASTER_PORT=1234 WORLD_SIZE=2 RANK=0 python3 boring_model.py

Expected behavior

Should wait for second job to connect to the MASTER_PORT. Instead, just starts training. No combination of arguments and environment variables seems to change this behavior.

It appears that PL handles startup for distributed processing differently somehow and internally overrides environment variables. This works pretty well for ddp on single nodes and (presumably) automates deployment under Slurm.

But I can’t figure out from the documentation or from the samples how to arrange for the traditional PyTorch behavior in which all jobs are started up manually.

For comparison, this code behaves as expected:

$ cat > simple.py
import torch

print("init")
torch.distributed.init_process_group("gloo")
print("done", torch.distributed.get_rank(), torch.distributed.get_world_size())
$ MASTER_ADDR=192.168.1.3 MASTER_PORT=1234 WORLD_SIZE=2 RANK=0 python3 simple.py & sleep 3; MASTER_ADDR=192.168.1.3 MASTER_PORT=1234 WORLD_SIZE=2 RANK=1 python3 simple.py

That is, the two jobs rendezvous as expected and then exit.

Suggested Behavior

document how the traditional PyTorch behavior can be reproduced (i.e., no calculations of nodes/size/… in DDP)
perhaps provide some kind of command line flag that restores the traditional behavior (e.g., “–accelerator ddp_plain”)
maybe something else

Environment

PyTorch Version (e.g., 1.0): 1.8.1
OS (e.g., Linux): Ubuntu 20.04
How you installed PyTorch (conda, pip, source): pip3
Build command you used (if compiling from source):
Python version: 3.8
CUDA/cuDNN version: 11.1
GPU models and configuration: 1080
Any other relevant information:

Additional context

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 4
Comments: 25 (13 by maintainers)

Most upvoted comments

@tmbdev In Lightning you don’t need to set the global, local and world size. It gets computed automatically. You only have to launch the script on every node and set the NODE_RANK like so (and set the num_nodes trainer argument):

NODE_RANK=1 MASTER_ADDR=192.168.1.3 MASTER_PORT=1234 python your_script.py

@awaelchli the creates_children naming is confusing. many readers would assume its the environment within lightning that creates children processes vs processes being created before trainer.fit()

Apologies, this was introduced recently by me. We can change the name when we do the property/setter refactor. #6303

awaelchli on Apr 17, 2021

Hey @tmbdev,

Looking into this asap ! Thanks for sharing a reproducible script 😃

By the way, you might want to demo Flash too in your examples: https://github.com/PyTorchLightning/lightning-flash

And would you mind join PyTorch Lighting Slack too: https://img.shields.io/badge/slack-chat-green.svg?logo=slack)](https://join.slack.com/t/pytorch-lightning/shared_invite/zt-f6bl2l0l-JYMK3tbAgAmGRrlNr00f1A.

In Flash, we are building a pretty dope DataPipeline: https://lightning-flash.readthedocs.io/en/latest/general/data.html and would love to provide a native support for WebDataset.

We will upstream everything to Lighting when stable 😃

Best, T.C

tchaton on Apr 30, 2021

@awaelchli the creates_children naming is confusing. many readers would assume its the environment within lightning that creates children processes vs processes being created before trainer.fit()

ananthsub on Apr 16, 2021