ray: NCCL error with RaySGD single node training

Can ray automatically transfer tensors between cpu memory and gpu memory without explicitlly do tensor.cuda()?
How do ray do cross gpu tensor communication? Can I use nccl to do high performance training?
Can I split a model into multiple parts, and ray schedule parts onto different gpus and automatically train the model?
I follow the example in https://docs.ray.io/en/master/auto_examples/plot_parameter_server.html and add @ray.remote(num_gpus=1) decorator to the server and worker, I discover the throughput is quite low. How can I use ray to do high performance training of DL models (like alexnet) with pytorch?

Thanks a lot!

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 32 (18 by maintainers)

Most upvoted comments

OK I found out the issue. It looks like if we use a public IP address in dist.init_process_group, the things don’t work so well. Specifically, the initialization hangs upon instantiation.

By commenting out L60 in sgd/torch/distributed_torch_runner.py, and then setting the following environment variables in init_hook:

def init_hook():
    import torch.backends.cudnn as cudnn
    cudnn.benchmark = True
    import os
    os.environ.pop("CUDA_VISIBLE_DEVICES")

and then calling init_process_group manually:

    def setup(self):
        dist.init_process_group(backend="nccl", init_method="tcp://127.0.0.1:29500")

I was able to get the performance between DDP and RaySGD to match. Note that:

cuda visible devices needed to be disabled
the init method pointed to a localhost address

cc @amogkam @matthewdeng we’ll want to check for this in for v2.

richardliaw on Aug 10, 2021

Hey @JF-D,

We have been developing a new version of Ray SGD with a cleaner API which you can see here: https://docs.ray.io/en/master/raysgd/v2/raysgd.html. This NCCL GPU issue is fixed on Ray SGD v2, so I would recommend you try it out. To easily migrate from Ray SGD v1 to v2 you can check out this guide: https://docs.ray.io/en/master/raysgd/v2/migration-guide.html.

Let me know if you have questions, I am happy to help!

amogkam on Oct 5, 2021

@JF-D thanks for your time; I’ll try to repro!

richardliaw on Jul 16, 2021

yep! something along that - still not public!

richardliaw on Jul 11, 2021