ignite: distributed program hangs in SLURM

🐛 Bug description

We got an urgent issue from MONAI and Clara users that the distributed program hangs in NVIDIA NSL-B platform, which is based on SLURM. You can reproduce the issue with this simple example: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_workflows.py It will hang when creating ignite Accurary metric, seems related to this line: https://github.com/pytorch/ignite/blob/v0.4.4.post1/ignite/distributed/comp_models/native.py#L107 After removing the Accurary metric from the example, it hangs when training started and hasn’t timeout yet. Please note that this example can run successfully with ignite 0.4.2. And we also tried the pure PyTorch dist example in the same hardware and software env, it can run successfully: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_ddp.py

Could you please help analyze the reason and give some advice? It blocks our cooperation with another team now.

Thanks in advance.

Environment

PyTorch Version (e.g., 1.4): 1.8.1
Ignite Version (e.g., 0.3.0): 0.4.4
OS (e.g., Linux): Ubuntu 18.04
How you installed Ignite (conda, pip, source): pip

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 60 (21 by maintainers)

Most upvoted comments

Hi @vfdev-5 and @sdesrozis,

Thanks for the prompt response! I remove the first idist.sync and change the second one to idist.barrier. Things are running fine now.

YuanTingHsieh on Jun 10, 2021

@hw-ju Thanks for the report.

it seems that your environment contains some variables set usually by PyTorch launcher. In the current ignite distributed module, slurm and PyTorch launcher are mutual exclusive, since srun has the same usage than PyTorch launcher. That explains the raised error.

However, I would say that the conflicting variables are set somewhere but it’s not easy to track. By the way, we have to find which script does the setting.

sdesrozis on Sep 16, 2021

@Nic-Ma thanks for the feedback, glad to help !

Seems the issue had been fixed.

I think an alternative to using scontrol still makes sense to add in our codebase as we would have potentially a similar issue if we are using our own docker images with slurm.

vfdev-5 on Jun 12, 2021

Seems the issue had been fixed. @vfdev-5 @sdesrozis Thanks so much for your great help as usual!!!

Nic-Ma on Jun 12, 2021

Thanks for reporting that @YuanTingHsieh , looks like it can be an enhancement from our side !

vfdev-5 on Jun 11, 2021

It means the command scontrol is not available in your environment. This command is typically installed on machines working with slurm. Quite surprising, I don’t know why it’s missing here.

Could you check whether scontrol available ? Thanks !

EDIT I suppose that is related to the container used to spawn your app on the cluster. I can’t check that on my side… Anyway, we could handle this situation ! The scontrol call is to select an hostname as MASTER_ADDR, it can be done in another way.

sdesrozis on Jun 11, 2021

SLURM is a workload manager, it means you can control how your machine is used. This is dedicated to HPC / IA clusters. For the end user, torch.distributed.launch or srun seems equal but in fact this is not the case. Some users use torch.distributed.launch without consider the limits except number of processes, but it exists memory, time, cpu per node (and not only gpu), and others limits. On a production slurm cluster, every thing has to be considered because you pay for the resources allocated (and not really consumed). Anyway, it depends on your usage and torch.distributed.launch should work fine in the most cases. Note that idist abstraction allows to be compatible with what you want without rewrite code, that was our concern.

HTH

sdesrozis on Sep 29, 2021

Is it safe to ignore this warning? Can rank-to-GPU mapping go wrong?

@hw-ju I would say we can ignore before we fix. In the message I saw in our tests, it is was pretty straightforward assignment between rank and GPU… Can it go wrong, I think we can make it fail while assigning local ranks manually but otherwise, I hope it wont be an issue.

I’m not sure if it’s appropriate to ask the question here, but what are the benefits of using torch.distributed.launch compared with pure SLURM launch?

I think it is fine. With SLURM it is more simple to launch multi-node training. torch.distributed.launch prior to v1.9.0 was just a tool to launch locally N python processes and setup env variables. In this case for multi-node training user had to call it from all nodes. Today, pytorch goes to elastic launch and things may be simplified, https://pytorch.org/docs/stable/elastic/run.html#launcher-api By the way, if you prefer, we have also GH Discussions and issues with Question labels, finally we have also Discord for questions.

I really appreciate idist’s support for SLURM!! 😃)

Thanks for the feedback ! And also thanks a lot for your patience while debugging our code issue with hostnames !

vfdev-5 on Sep 28, 2021

@vfdev-5 Thanks for all the help!

Yes, this warning appears with the new version of NCCL, we’ll also fix it.

Is it safe to ignore this warning? Can rank-to-GPU mapping go wrong?

Unfortunately, we can not ignore it by any additional call as we call barrier internally. And yes, you are right, idist.barrier() does not accept any kwargs. We can think to add that if it make also sense for other backends.

idist.barrier() is for all processes?

I’m not sure if it’s appropriate to ask the question here, but what are the benefits of using torch.distributed.launch compared with pure SLURM launch?
I really appreciate idist’s support for SLURM!! 😃)

hw-ju on Sep 28, 2021

@sdesrozis I confirm that _expand_hostlist is buggy:

_expand_hostlist('c1001a-s[11,17]')
> 
['s17', 's11']

scontrol show hostnames 'c1001a-s[11,17]'
>
c1001a-s11
c1001a-s17

vfdev-5 on Sep 17, 2021

@sdesrozis can we make our internal check a bit less strict: even if we have LOCAL_RANK, RANK and WORLD_SIZE defined in env vars and if they match corresponding SLURM env vars like SLURM_LOCALID, SLURM_PROCID and SLURM_NTASKS, we can continue otherwise we can raise the error. What do you think ?

vfdev-5 on Sep 16, 2021

@hw-ju thanks a lot for the logs ! Looks like when you execute the code on the workers with srun (inside sbatch file), somehow you have already defined “RANK”, “LOCAL_RANK” and “WORLD_SIZE” env variables. Do you know which process sets them or why they are already present ? Ignite raises the issue because of that.

To make Ignite work we can do the following hacky solution:

import os
import socket
import argparse
import ignite.distributed as idist
import ignite


def main_fn(_):

    hostname = socket.gethostname()

    for current in range(idist.get_world_size()):
        if idist.get_rank() == current:
            addr = f"http://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
            print(f"[{addr}] hello from [{hostname}:{idist.backend()}] " + 
                  f"process {idist.get_rank()}/{idist.get_world_size()}")
        idist.barrier()


if __name__ == "__main__":
    print(f"ignite version: {ignite.__version__}")

    # Remove existing env vars to make Ignite "happily" work:
    for e in ["RANK", "LOCAL_RANK", "WORLD_SIZE"]:
        if e in os.environ:        
            del os.environ[e]

    parser = argparse.ArgumentParser("single-node")
    parser.add_argument("--backend", type=str, default="nccl")
    args = parser.parse_args()

    with idist.Parallel(backend=args.backend) as parallel:
        parallel.run(main_fn)

However, we would like to understand why and when we can encounter somewhere else the same situation when using SLURM we have these env vars automatically provided, while by default SLURM does not do that…

vfdev-5 on Sep 16, 2021

Hi @YuanTingHsieh , yes this is expected as seems like you are using SLURM and torch.distributed.launch. We suggest to use only SLURM as processes scheduler:

- srun --nodes=2 python -m torch.distributed.launch --nproc_per_node=2 main.py
+ srun --nodes=2 --ntasks-per-node=2 python main.py

Please, see our blog post on distributed with SLURM: https://labs.quansight.org/blog/2021/06/distributed-made-easy-with-ignite/#with-slurm

cc @sdesrozis @fco-dv

vfdev-5 on Aug 6, 2021

scontrol is available, I guess you are right it is that the container is somehow not picking that up

YuanTingHsieh on Jun 11, 2021

I see! Let me have a try and get back to you soon.

Thanks for your help!

Nic-Ma on Jun 10, 2021