ColossalAI: [BUG]: RuntimeError of "RANK" when running train.py of ResNet example on a single GPU

๐Ÿ› Describe the bug

I met a problem today when running with python train.py, as below,

/home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python /home/user/***/***
/ColossalAI-Examples/image/resnet/train.py
Traceback (most recent call last):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/initialize.py", line 210, in launch_from_torch
    rank = int(os.environ['RANK'])
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'RANK'

During handling of the above exception, another exception occurred:

...

RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch

Is this error due to the absence of environment variable RANK in my Ubuntu?

Environment

Python: 3.10

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 23 (10 by maintainers)

Most upvoted comments

torchrun is using python but in a distributed manner. Distributed training requires launching multiple processes together, therefore, we need something to spawn the processes and set the environment for inter-process communication. That is what torchrun or colossalai run does. If we simply run the script with python train.py, the distributed communication network cannot be initialized.

Iโ€™m not familiar with PyCharm debug. But itโ€™s totally possible to use vscode to run the code in debug mode. For single-GPU distributed training, it behaves almost the same as running a single process script with python interpreter (You can set breakpoints and use the debug console to interact with the script). Just use the following launch.json template in vscode:


{
  "name": "train",
  "type": "python",
  "request": "launch",
  "module": "torch.distributed.run",  // This invokes torchrun
  "args": [
    // Command line args goes here
    "train.py",
  ],
  "console": "integratedTerminal",
  // Environment variables goes here.
  // "env": {"CUDA_LAUNCH_BLOCKING": "1"}
}

Did you run the script with torchrun?

@songyuc I guess you mean torch.distributed.launch ? They serve the same purpose and I guess thereโ€™s not much difference, but torch.distributed.launch is being deprecated in latest pytorch versions. So if you are using recent versions of pytorch, you should use torchrun.

You should use colossalai run or torchrun, they will set your environment internally.