xgboost: dask multinode: /workspace/xgboost/src/tree/updater_gpu_hist.cu:794: Exception in gpu_hist: NCCL failure :unhandled system error /workspace/xgboost/src

@trivialfis We got done with upgrading to latest python3.8 + rapids 0.19.2 and all basic things working, including dask on single node. I’m now testing multinode and it seems worse than before. I immediately get the below error:

2 nodes, 2 GPUs on scheduler, 1 on worker (https://github.com/dmlc/xgboost/issues/6397): CRASHES OR 2 nodes, 1 GPU on scheduler, 1 on worker :CRASHES

2021-06-02 14:02:39,438 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     model.fit(X, y, sample_weight=sample_weight, **kwargs)
2021-06-02 14:02:39,439 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1802, in fit
2021-06-02 14:02:39,439 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     return self._client_sync(self._fit_async, **args)
2021-06-02 14:02:39,440 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1610, in _client_sync
2021-06-02 14:02:39,440 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     return self.client.sync(func, **kwargs, asynchronous=asynchronous)
2021-06-02 14:02:39,441 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 843, in sync
2021-06-02 14:02:39,441 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     return sync(
2021-06-02 14:02:39,442 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 353, in sync
2021-06-02 14:02:39,442 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     raise exc.with_traceback(tb)
2021-06-02 14:02:39,442 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 336, in f
2021-06-02 14:02:39,443 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     result[0] = yield future
2021-06-02 14:02:39,443 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
2021-06-02 14:02:39,444 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     value = future.result()
2021-06-02 14:02:39,444 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1760, in _fit_async
2021-06-02 14:02:39,445 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     results = await self.client.sync(
2021-06-02 14:02:39,445 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 915, in _train_async
2021-06-02 14:02:39,446 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     results = await client.gather(futures, asynchronous=True)
2021-06-02 14:02:39,446 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 1840, in _gather
2021-06-02 14:02:39,446 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     raise exception.with_traceback(traceback)
2021-06-02 14:02:39,447 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 870, in dispatched_train
2021-06-02 14:02:39,447 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     bst = worker_train(params=local_param,
2021-06-02 14:02:39,448 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/training.py", line 189, in train
2021-06-02 14:02:39,448 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     bst = _train_internal(params, dtrain,
2021-06-02 14:02:39,449 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/training.py", line 81, in _train_internal
2021-06-02 14:02:39,449 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     bst.update(dtrain, i, obj)
2021-06-02 14:02:39,449 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/core.py", line 1571, in update
2021-06-02 14:02:39,450 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
2021-06-02 14:02:39,450 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/core.py", line 214, in _check_call
2021-06-02 14:02:39,451 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     raise XGBoostError(py_str(_LIB.XGBGetLastError()))
2021-06-02 14:02:39,451 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   | xgboost.core.XGBoostError: [14:02:39] /workspace/xgboost/src/tree/updater_gpu_hist.cu:794: Exception in gpu_hist: NCCL failure :unhandled system error /workspace/xgboost/src
2021-06-02 14:02:39,452 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   | 
2021-06-02 14:02:39,452 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   | Stack trace:
2021-06-02 14:02:39,453 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x63e359) [0x147e7d56e359]
2021-06-02 14:02:39,453 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::G
2021-06-02 14:02:39,454 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::G
2021-06-02 14:02:39,454 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboo
2021-06-02 14:02:39,454 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x32
2021-06-02 14:02:39,455 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x147e7d136188]
2021-06-02 14:02:39,455 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (6) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x147fea62a9dd]
2021-06-02 14:02:39,456 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x147fea62a067]
2021-06-02 14:02:39,456 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x10da8) [0x147fea640da8]
2021-06-02 14:02:39,457 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |

This was where one system has 2 GPUs and the other worker has 1 GPU. You asked to re-test that. It hangs everything after the failure as well, which is also unfortunate.

~I’ll try homogeneous setup.~ Homogeneous setup with 1GPU for each of 2 nodes also fails same way.

Single node but still using dask scheduler/worker and client, 2 GPUs HANGS:

I also tried single node but using dask scheduler/worker on the node, and it worked for a little bit and then hung. It seemed to work for GBM for hang for RF.

Thread 0x0000150f09b9f700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000150f037fb700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
  File "/home/jon/minicondadai_py38/lib/python3.8/queue.py", line 179 in get
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000150f035fa700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
  File "/home/jon/minicondadai_py38/lib/python3.8/queue.py", line 179 in get
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000150f02df6700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/selectors.py", line 468 in select
  File "/home/jon/minicondadai_py38/lib/python3.8/asyncio/base_events.py", line 1823 in _run_once
  File "/home/jon/minicondadai_py38/lib/python3.8/asyncio/base_events.py", line 570 in run_forever
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199 in start
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 430 in run_loop
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x0000150f18744700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 558 in wait
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 350 in sync
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 843 in sync
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1610 in _client_sync
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1802 in fit

no errors appear in the hang.

Single node with local cuda cluster model fitting. Looked like it was going ok at first, but then kept hitting:

[17:52:02] /workspace/xgboost/src/c_api/../data/../common/common.h:45: /workspace/xgboost/src/metric/../common/device_helpers.cuh: 1347: cudaErrorInvalidValue: invalid argument
Stack trace:
  [bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x457209) [0x14ffb9387209]
  [bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x190) [0x14ffb9389800]
  [bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(void dh::ArgSort<false, unsigned long, float const>(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::common::Span<unsigned long, 18446744073709551615ul>)+0x255) [0x14ffb9440655]
  [bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::GPUBinaryAUC(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::MetaInfo const&, int, std::shared_ptrxgboost::metric::DeviceAUCCache*)+0x1ee) [0x14ffb9430fde]
  [bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAUC::Eval(xgboost::HostDeviceVector const&, xgboost::MetaInfo const&, bool)+0x622) [0x14ffb9273cb2]
  [bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptrxgboost::DMatrix, std::allocator<std::shared_ptrxgboost::DMatrix > > const&, std::vector<std::string, std::allocatorstd::string > const&)+0x44f) [0x14ffb924d24f]
  [bt] (6) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x39d) [0x14ffb913a89d]
  [bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x15013b1a69dd]
  [bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x15013b1a6067]

or hit sometimes:

[21:53:03] /workspace/xgboost/src/c_api/../data/../common/common.h:45: /workspace/xgboost/src/metric/../common/device_helpers.cuh: 1347: cudaErrorInvalidValue: invalid argument
Stack trace:
[bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x457209) [0x1500a1387209]
[bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x190) [0x1500a1389800]
[bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(void dh::ArgSort<false, unsigned long, float const>(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::common::Span<unsigned long, 18446744073709551615ul>)+0x255) [0x1500a1440655]
[bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::GPUBinaryAUC(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::MetaInfo const&, int, std::shared_ptrxgboost::metric::DeviceAUCCache*)+0x1ee) [0x1500a1430fde]
[bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAUC::Eval(xgboost::HostDeviceVector const&, xgboost::MetaInfo const&, bool)+0x622) [0x1500a1273cb2]
[bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptrxgboost::DMatrix, std::allocator<std::shared_ptrxgboost::DMatrix > > const&, std::vector<std::string, std::allocatorstd::string > const&)+0x44f) [0x1500a124d24f]
[bt] (6) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x39d) [0x1500a113a89d]
[bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x15023bcd39dd]
[bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x15023bcd3067]

Oddly once it started failing, it never recovered and every fit after fails.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 37 (37 by maintainers)

Most upvoted comments

If you saw cudaErrorInvalidValue, I suggest the first option to try is making sure HIDE_CXX_SYMBOLS. I encountered it a few times all caused by a corrupted stack due to c++ symbol conflicts between libraries.

trivialfis on Jun 7, 2021

So far I haven’t been able to reproduce it with compute-sanitizer. I tried empty dataset and non-empty dataset, snmg and mnmg. dask_cudf and dd.dataframe are also tested.

trivialfis on Jun 3, 2021