ignite: Unable to create DiskSaver when program launched with torch.distributed.launcher
🐛 Bug description
As mentioned in this issue in MONAI, I tried to run this tutorial code with torch.distributed.launcher
. However, the program froze at instantiating the CheckpointSaver. The reason was that DiskSaver
of ignite
cannot be created when the program is launched with torch.distributed.launcher
(I am using SLURM). I also noticed that it might be caused by calling get_rank()
in the one_rank_only
decorator, which is used in the definition of DiskSaver
: https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/utils.py#L595
I also did a simple experiment to verify this. I launched the following script with srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.py
, and found that the program froze at creating the DiskSaver
.
import torch.distributed as dist
from ignite.handlers import DiskSaver
from argparse import ArgumentParser
def create_disk_saver(args):
dist.init_process_group(backend='nccl', init_method='env://')
if dist.get_rank() == 0:
print('building DiskSaver')
disk_saver = DiskSaver(dirname='./runs/')
print('DiskSaver built')
dist.destroy_process_group()
def main():
parser = ArgumentParser()
parser.add_argument('--local_rank', type=int)
args = parser.parse_args()
create_disk_saver(args)
if __name__ == '__main__':
main()
I would be much appreciated if you could fix this. I prefer launching the program with torch.distributed.launcher
to ignite.distributed.Parallel
context manager, as it has less issues with the SLURM env.
Environment
- PyTorch Version (e.g., 1.4): 1.8
- Ignite Version (e.g., 0.3.0): 0.4.4
- OS (e.g., Linux): Linux
- How you installed Ignite (
conda
,pip
, source): pip - Python version: 3.8
- Any other relevant information:
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 2
- Comments: 27 (21 by maintainers)
@sandylaker I tried a few runs on the cluster of my company.
1 - using
srun
andtorch.launch.distributed
withoutignite.distributed.Parallel
script, README
NOTE : I removed some mandatory options like
-J
,-p
,--mem
, etc. related to the own configuration of my cluster.2 - using
srun
withouttorch.launch.distributed
andignite.distributed.Parallel
script, README
3 - using
srun
andtorch.launch.distributed
withignite.distributed.Parallel
script, README
One script, both usages.
On a computing node, use
torch.launch.distributed
On the frontend, use
srun
(orsbatch
)HTH
@vfdev-5 Yes, That’s what I mentioned looking the code a few days ago. However you explained it better 😊
The parallel / sequential sections remain a tricky (and classical) thing in parallel computing. Having to manage the 2 behaviours (collective call similar to reduction or guard per processor) makes the codes more complicated. An idea would be to have only handlers defined in collective, we avoid the if clauses and it’s simpler.
Although I don’t know if the bug label should be added to this issue.
Last thing, I didn’t understand how
idist.sync()
would help, it doesn’t remove the collective code section ?You can do as you prefer, but using
ignite.distributed.Parallel
, you would be able to usetorch.distributed.launch
,torch.distributed.spawn
, slurm,xla
andhorovod
as well, with a unique code.Please have a look here https://github.com/sdesrozis/why-ignite/tree/main/basics
We are currently finishing writing a blog article explaining how ignite can help about parallel computing.
HTH
@sandylaker could you please test this code with nightly version :
pip install --pre pytorch-ignite
? I think it should raise this runtime error: https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/comp_models/native.py#L218-L222In general, I think calling
srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.py
is incorrect assrun
create new job with 1 proc per node andtorch.distributed.launcher
spawns 4 proc per node. What do you think ?Just looking your code, it can’t work if you create the
DiskSaver
in a if section only restricted to one process. It seems thatDiskSaver
needs a collective__init__
call.