ignite: distributed program hangs in SLURM
š Bug description
Hi @vfdev-5 ,
We got an urgent issue from MONAI and Clara users that the distributed program hangs in NVIDIA NSL-B platform, which is based on SLURM. You can reproduce the issue with this simple example: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_workflows.py It will hang when creating ignite Accurary metric, seems related to this line: https://github.com/pytorch/ignite/blob/v0.4.4.post1/ignite/distributed/comp_models/native.py#L107 After removing the Accurary metric from the example, it hangs when training started and hasnāt timeout yet. Please note that this example can run successfully with ignite 0.4.2. And we also tried the pure PyTorch dist example in the same hardware and software env, it can run successfully: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_ddp.py
Could you please help analyze the reason and give some advice? It blocks our cooperation with another team now.
Thanks in advance.
Environment
- PyTorch Version (e.g., 1.4): 1.8.1
- Ignite Version (e.g., 0.3.0): 0.4.4
- OS (e.g., Linux): Ubuntu 18.04
- How you installed Ignite (
conda
,pip
, source): pip
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 60 (21 by maintainers)
Hi @vfdev-5 and @sdesrozis,
Thanks for the prompt response! I remove the first idist.sync and change the second one to idist.barrier. Things are running fine now.
@hw-ju Thanks for the report.
it seems that your environment contains some variables set usually by PyTorch launcher. In the current ignite distributed module, slurm and PyTorch launcher are mutual exclusive, since srun has the same usage than PyTorch launcher. That explains the raised error.
However, I would say that the conflicting variables are set somewhere but itās not easy to track. By the way, we have to find which script does the setting.
@Nic-Ma thanks for the feedback, glad to help !
I think an alternative to using scontrol still makes sense to add in our codebase as we would have potentially a similar issue if we are using our own docker images with slurm.
Seems the issue had been fixed. @vfdev-5 @sdesrozis Thanks so much for your great help as usual!!!
Thanks for reporting that @YuanTingHsieh , looks like it can be an enhancement from our side !
It means the command
scontrol
is not available in your environment. This command is typically installed on machines working withslurm
. Quite surprising, I donāt know why itās missing here.Could you check whether
scontrol
available ? Thanks !EDIT I suppose that is related to the container used to spawn your app on the cluster. I canāt check that on my side⦠Anyway, we could handle this situation ! The
scontrol
call is to select an hostname asMASTER_ADDR
, it can be done in another way.SLURM is a workload manager, it means you can control how your machine is used. This is dedicated to HPC / IA clusters. For the end user,
torch.distributed.launch
orsrun
seems equal but in fact this is not the case. Some users usetorch.distributed.launch
without consider the limits except number of processes, but it exists memory, time, cpu per node (and not only gpu), and others limits. On a production slurm cluster, every thing has to be considered because you pay for the resources allocated (and not really consumed). Anyway, it depends on your usage andtorch.distributed.launch
should work fine in the most cases. Note thatidist
abstraction allows to be compatible with what you want without rewrite code, that was our concern.HTH
@hw-ju I would say we can ignore before we fix. In the message I saw in our tests, it is was pretty straightforward assignment between rank and GPU⦠Can it go wrong, I think we can make it fail while assigning local ranks manually but otherwise, I hope it wont be an issue.
I think it is fine. With SLURM it is more simple to launch multi-node training.
torch.distributed.launch
prior to v1.9.0 was just a tool to launch locally N python processes and setup env variables. In this case for multi-node training user had to call it from all nodes. Today, pytorch goes to elastic launch and things may be simplified, https://pytorch.org/docs/stable/elastic/run.html#launcher-api By the way, if you prefer, we have also GH Discussions and issues with Question labels, finally we have also Discord for questions.Thanks for the feedback ! And also thanks a lot for your patience while debugging our code issue with hostnames !
@vfdev-5 Thanks for all the help!
Is it safe to ignore this warning? Can rank-to-GPU mapping go wrong?
idist.barrier()
is for all processes?Iām not sure if itās appropriate to ask the question here, but what are the benefits of using
torch.distributed.launch
compared with pure SLURM launch?I really appreciate idistās support for SLURM!! š)
@sdesrozis I confirm that
_expand_hostlist
is buggy:vs
@sdesrozis can we make our internal check a bit less strict: even if we have LOCAL_RANK, RANK and WORLD_SIZE defined in env vars and if they match corresponding SLURM env vars like SLURM_LOCALID, SLURM_PROCID and SLURM_NTASKS, we can continue otherwise we can raise the error. What do you think ?
@hw-ju thanks a lot for the logs ! Looks like when you execute the code on the workers with srun (inside sbatch file), somehow you have already defined āRANKā, āLOCAL_RANKā and āWORLD_SIZEā env variables. Do you know which process sets them or why they are already present ? Ignite raises the issue because of that.
To make Ignite work we can do the following hacky solution:
However, we would like to understand why and when we can encounter somewhere else the same situation when using SLURM we have these env vars automatically provided, while by default SLURM does not do thatā¦
Hi @YuanTingHsieh , yes this is expected as seems like you are using SLURM and
torch.distributed.launch
. We suggest to use only SLURM as processes scheduler:Please, see our blog post on distributed with SLURM: https://labs.quansight.org/blog/2021/06/distributed-made-easy-with-ignite/#with-slurm
cc @sdesrozis @fco-dv
scontrol is available, I guess you are right it is that the container is somehow not picking that up
I see! Let me have a try and get back to you soon.
Thanks for your help!