ColossalAI: [BUG]: RuntimeError of "RANK" when running train.py of ResNet example on a single GPU

🐛 Describe the bug

I met a problem today when running with python train.py, as below,

/home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python /home/user/***/***
/ColossalAI-Examples/image/resnet/train.py
Traceback (most recent call last):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/initialize.py", line 210, in launch_from_torch
    rank = int(os.environ['RANK'])
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'RANK'

During handling of the above exception, another exception occurred:

...

RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch

Is this error due to the absence of environment variable RANK in my Ubuntu?

Environment

Python: 3.10

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 23 (10 by maintainers)

Most upvoted comments

torchrun is using python but in a distributed manner. Distributed training requires launching multiple processes together, therefore, we need something to spawn the processes and set the environment for inter-process communication. That is what torchrun or colossalai run does. If we simply run the script with python train.py, the distributed communication network cannot be initialized.

FrankLeeeee on Jun 7, 2022

I’m not familiar with PyCharm debug. But it’s totally possible to use vscode to run the code in debug mode. For single-GPU distributed training, it behaves almost the same as running a single process script with python interpreter (You can set breakpoints and use the debug console to interact with the script). Just use the following launch.json template in vscode:

{
  "name": "train",
  "type": "python",
  "request": "launch",
  "module": "torch.distributed.run",  // This invokes torchrun
  "args": [
    // Command line args goes here
    "train.py",
  ],
  "console": "integratedTerminal",
  // Environment variables goes here.
  // "env": {"CUDA_LAUNCH_BLOCKING": "1"}
}

yuxinyuan on Jun 8, 2022

Did you run the script with torchrun?

feifeibear on Jun 7, 2022

@songyuc I guess you mean torch.distributed.launch ? They serve the same purpose and I guess there’s not much difference, but torch.distributed.launch is being deprecated in latest pytorch versions. So if you are using recent versions of pytorch, you should use torchrun.

yuxinyuan on Jun 8, 2022

You should use colossalai run or torchrun, they will set your environment internally.

YuliangLiu0306 on Jun 7, 2022