dask-cuda: [BUG] Difference between MNMG and SNMG.
Background
We are running into an NCCL error https://github.com/dmlc/xgboost/issues/7019 that’s only reproducible by running 2 dask-cuda-worker on 2 different GPUs. These 2 GPUs can be on the same node or on different nodes. In my testing case, it’s 1080TI + 1080 attached to the same node.
Basically, launching all GPUs with one single dask-cuda-worker works:
dask-cuda-worker --scheduler-file="sched.json"
Similarly, using LocalCUDACluster also works fine. However, launching 2 GPUs separately causes an NCCL initialization failure:
CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file="sched.json"
CUDA_VISIBLE_DEVICES=1 dask-cuda-worker --scheduler-file="sched.json"
The error message from cuda-memcheck starts with (I used cuda-memcheck to run the worker):
========= Program hit cudaErrorPeerAccessUnsupported (error 217) due to "peer access is not supported between these two devices" on CUDA API call to cudaIpcOpenMemHandle.
The error is not reproducible on some regular clusters like 2x GTX8000 with nvlink nor 2x v100 (tested with both SNMG and MNMG). Maybe @pseudotensor can provide more context here.
Version
dask-cuda version: dask-cuda=0.19.0=py37_0
Test
Since this is reproduced by checking nccl initialization, using XGBoost to run the test seems simple for confirming the error since it already has all the pipelines lined up.
- On a system with 2 GPUs:
dask-scheduler --scheduler-file="sched.json"
CUDA_VISIBLE_DEVICES=0 cuda-memcheck dask-cuda-worker --scheduler-file="sched.json"
CUDA_VISIBLE_DEVICES=1 cuda-memcheck dask-cuda-worker --scheduler-file="sched.json"
You can optionally use cuda-memcheck to see the root CUDA error, without it the error would be nccl complaining not handled system error.
- Then run this script
import xgboost as xgb
from dask_cuda import LocalCUDACluster
from distributed import Client
from dask import dataframe as dd
from sklearn.datasets import load_digits
import dask_cudf
def test_empty_dataset(client):
X_, y_ = load_digits(n_class=2, return_X_y=True)
chunksize = X_.shape[0] // 10
X = dd.from_array(X_, chunksize=chunksize)
y = dd.from_array(y_, chunksize=chunksize)
clf = xgb.dask.DaskXGBClassifier(tree_method="gpu_hist")
clf.client = client
valid_X = dd.from_array(X_, chunksize=chunksize).repartition(npartitions=1)
valid_y = dd.from_array(y_, chunksize=chunksize).repartition(npartitions=1)
X = dask_cudf.from_dask_dataframe(X)
y = dask_cudf.from_dask_dataframe(y)
valid_X = dask_cudf.from_dask_dataframe(valid_X)
valid_y = dask_cudf.from_dask_dataframe(valid_y)
clf.fit(X, y)
def snmg():
with LocalCUDACluster() as cluster:
with Client(cluster) as client:
test_empty_dataset(client)
def mnmg():
with Client(scheduler_file="sched.json") as client:
test_empty_dataset(client)
if __name__ == "__main__":
# snmg()
mnmg()
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 22 (21 by maintainers)
Thanks for all the help @pentschev ! I will close this one now since it’s not related to dask-cuda.
What I heard from NCCL developers is that there’s no guarantee
cudaIpcOpenMemHandlewill succeed if the process doesn’t know the index of the device on the other end, thus it seems to confirm my suspicion.Ideas for further testing:
NCCL_DEBUG=INFO, we can send the logs to devs so they can check whether something can be done;NCCL_P2P_DISABLE=1.I have tried both CUDA 11.2 and 11.0
It shouldn’t be XGBoost specific, it’s an NCCL error and XGBoost happens to use NCCL so I generate this minimal reproducible example.