ray: NCCL error with RaySGD single node training
- Can ray automatically transfer tensors between cpu memory and gpu memory without explicitlly do
tensor.cuda()? - How do ray do cross gpu tensor communication? Can I use nccl to do high performance training?
- Can I split a model into multiple parts, and ray schedule parts onto different gpus and automatically train the model?
- I follow the example in https://docs.ray.io/en/master/auto_examples/plot_parameter_server.html and add
@ray.remote(num_gpus=1)decorator to the server and worker, I discover the throughput is quite low. How can I use ray to do high performance training of DL models (like alexnet) with pytorch?
Thanks a lot!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 32 (18 by maintainers)
OK I found out the issue. It looks like if we use a public IP address in dist.init_process_group, the things don’t work so well. Specifically, the initialization hangs upon instantiation.
By commenting out L60 in sgd/torch/distributed_torch_runner.py, and then setting the following environment variables in init_hook:
and then calling init_process_group manually:
I was able to get the performance between DDP and RaySGD to match. Note that:
cc @amogkam @matthewdeng we’ll want to check for this in for v2.
Hey @JF-D,
We have been developing a new version of Ray SGD with a cleaner API which you can see here: https://docs.ray.io/en/master/raysgd/v2/raysgd.html. This NCCL GPU issue is fixed on Ray SGD v2, so I would recommend you try it out. To easily migrate from Ray SGD v1 to v2 you can check out this guide: https://docs.ray.io/en/master/raysgd/v2/migration-guide.html.
Let me know if you have questions, I am happy to help!
@JF-D thanks for your time; I’ll try to repro!
yep! something along that - still not public!