ColossalAI: [BUG]: RuntimeError of "RANK" when running train.py of ResNet example on a single GPU
๐ Describe the bug
I met a problem today when running with python train.py, as below,
/home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python /home/user/***/***
/ColossalAI-Examples/image/resnet/train.py
Traceback (most recent call last):
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/initialize.py", line 210, in launch_from_torch
rank = int(os.environ['RANK'])
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/os.py", line 679, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
During handling of the above exception, another exception occurred:
...
RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch
Is this error due to the absence of environment variable RANK in my Ubuntu?
Environment
Python: 3.10
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 23 (10 by maintainers)
torchrunis usingpythonbut in a distributed manner. Distributed training requires launching multiple processes together, therefore, we need something to spawn the processes and set the environment for inter-process communication. That is whattorchrunorcolossalai rundoes. If we simply run the script withpython train.py, the distributed communication network cannot be initialized.Iโm not familiar with PyCharm debug. But itโs totally possible to use vscode to run the code in debug mode. For single-GPU distributed training, it behaves almost the same as running a single process script with python interpreter (You can set breakpoints and use the debug console to interact with the script). Just use the following launch.json template in vscode:
Did you run the script with torchrun?
@songyuc I guess you mean
torch.distributed.launch? They serve the same purpose and I guess thereโs not much difference, buttorch.distributed.launchis being deprecated in latest pytorch versions. So if you are using recent versions of pytorch, you should usetorchrun.You should use colossalai run or torchrun, they will set your environment internally.