dask-cuda: [BUG] Difference between MNMG and SNMG.

Background

We are running into an NCCL error https://github.com/dmlc/xgboost/issues/7019 that’s only reproducible by running 2 dask-cuda-worker on 2 different GPUs. These 2 GPUs can be on the same node or on different nodes. In my testing case, it’s 1080TI + 1080 attached to the same node.

Basically, launching all GPUs with one single dask-cuda-worker works:

dask-cuda-worker --scheduler-file="sched.json"

Similarly, using LocalCUDACluster also works fine. However, launching 2 GPUs separately causes an NCCL initialization failure:

CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file="sched.json"
CUDA_VISIBLE_DEVICES=1 dask-cuda-worker --scheduler-file="sched.json"

The error message from cuda-memcheck starts with (I used cuda-memcheck to run the worker):

========= Program hit cudaErrorPeerAccessUnsupported (error 217) due to "peer access is not supported between these two devices" on CUDA API call to cudaIpcOpenMemHandle.

The error is not reproducible on some regular clusters like 2x GTX8000 with nvlink nor 2x v100 (tested with both SNMG and MNMG). Maybe @pseudotensor can provide more context here.

Version

dask-cuda version: dask-cuda=0.19.0=py37_0

Test

Since this is reproduced by checking nccl initialization, using XGBoost to run the test seems simple for confirming the error since it already has all the pipelines lined up.

On a system with 2 GPUs:

dask-scheduler --scheduler-file="sched.json"

CUDA_VISIBLE_DEVICES=0 cuda-memcheck dask-cuda-worker --scheduler-file="sched.json"
CUDA_VISIBLE_DEVICES=1 cuda-memcheck dask-cuda-worker --scheduler-file="sched.json"

You can optionally use cuda-memcheck to see the root CUDA error, without it the error would be nccl complaining not handled system error.

Then run this script

import xgboost as xgb
from dask_cuda import LocalCUDACluster
from distributed import Client
from dask import dataframe as dd
from sklearn.datasets import load_digits
import dask_cudf


def test_empty_dataset(client):
    X_, y_ = load_digits(n_class=2, return_X_y=True)
    chunksize = X_.shape[0] // 10
    X = dd.from_array(X_, chunksize=chunksize)
    y = dd.from_array(y_, chunksize=chunksize)

    clf = xgb.dask.DaskXGBClassifier(tree_method="gpu_hist")
    clf.client = client

    valid_X = dd.from_array(X_, chunksize=chunksize).repartition(npartitions=1)
    valid_y = dd.from_array(y_, chunksize=chunksize).repartition(npartitions=1)

    X = dask_cudf.from_dask_dataframe(X)
    y = dask_cudf.from_dask_dataframe(y)

    valid_X = dask_cudf.from_dask_dataframe(valid_X)
    valid_y = dask_cudf.from_dask_dataframe(valid_y)

    clf.fit(X, y)


def snmg():
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            test_empty_dataset(client)


def mnmg():
    with Client(scheduler_file="sched.json") as client:
        test_empty_dataset(client)


if __name__ == "__main__":
    # snmg()
    mnmg()

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 22 (21 by maintainers)

Most upvoted comments

Thanks for all the help @pentschev ! I will close this one now since it’s not related to dask-cuda.

trivialfis on Jun 7, 2021

What I heard from NCCL developers is that there’s no guarantee cudaIpcOpenMemHandle will succeed if the process doesn’t know the index of the device on the other end, thus it seems to confirm my suspicion.

Ideas for further testing:

Run with NCCL_DEBUG=INFO, we can send the logs to devs so they can check whether something can be done;
Try with NCCL_P2P_DISABLE=1.

pentschev on Jun 4, 2021

I have tried both CUDA 11.2 and 11.0

We don’t test xgboost with dask-cuda currently,

It shouldn’t be XGBoost specific, it’s an NCCL error and XGBoost happens to use NCCL so I generate this minimal reproducible example.

trivialfis on Jun 4, 2021