xgboost: dask multinode: /workspace/xgboost/src/tree/updater_gpu_hist.cu:794: Exception in gpu_hist: NCCL failure :unhandled system error /workspace/xgboost/src

@trivialfis We got done with upgrading to latest python3.8 + rapids 0.19.2 and all basic things working, including dask on single node. I’m now testing multinode and it seems worse than before. I immediately get the below error:

  1. 2 nodes, 2 GPUs on scheduler, 1 on worker (https://github.com/dmlc/xgboost/issues/6397): CRASHES OR 2 nodes, 1 GPU on scheduler, 1 on worker :CRASHES
2021-06-02 14:02:39,438 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     model.fit(X, y, sample_weight=sample_weight, **kwargs)
2021-06-02 14:02:39,439 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1802, in fit
2021-06-02 14:02:39,439 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     return self._client_sync(self._fit_async, **args)
2021-06-02 14:02:39,440 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1610, in _client_sync
2021-06-02 14:02:39,440 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     return self.client.sync(func, **kwargs, asynchronous=asynchronous)
2021-06-02 14:02:39,441 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 843, in sync
2021-06-02 14:02:39,441 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     return sync(
2021-06-02 14:02:39,442 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 353, in sync
2021-06-02 14:02:39,442 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     raise exc.with_traceback(tb)
2021-06-02 14:02:39,442 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 336, in f
2021-06-02 14:02:39,443 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     result[0] = yield future
2021-06-02 14:02:39,443 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
2021-06-02 14:02:39,444 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     value = future.result()
2021-06-02 14:02:39,444 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1760, in _fit_async
2021-06-02 14:02:39,445 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     results = await self.client.sync(
2021-06-02 14:02:39,445 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 915, in _train_async
2021-06-02 14:02:39,446 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     results = await client.gather(futures, asynchronous=True)
2021-06-02 14:02:39,446 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 1840, in _gather
2021-06-02 14:02:39,446 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     raise exception.with_traceback(traceback)
2021-06-02 14:02:39,447 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 870, in dispatched_train
2021-06-02 14:02:39,447 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     bst = worker_train(params=local_param,
2021-06-02 14:02:39,448 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/training.py", line 189, in train
2021-06-02 14:02:39,448 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     bst = _train_internal(params, dtrain,
2021-06-02 14:02:39,449 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/training.py", line 81, in _train_internal
2021-06-02 14:02:39,449 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     bst.update(dtrain, i, obj)
2021-06-02 14:02:39,449 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/core.py", line 1571, in update
2021-06-02 14:02:39,450 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
2021-06-02 14:02:39,450 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/core.py", line 214, in _check_call
2021-06-02 14:02:39,451 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |     raise XGBoostError(py_str(_LIB.XGBGetLastError()))
2021-06-02 14:02:39,451 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   | xgboost.core.XGBoostError: [14:02:39] /workspace/xgboost/src/tree/updater_gpu_hist.cu:794: Exception in gpu_hist: NCCL failure :unhandled system error /workspace/xgboost/src
2021-06-02 14:02:39,452 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   | 
2021-06-02 14:02:39,452 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   | Stack trace:
2021-06-02 14:02:39,453 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x63e359) [0x147e7d56e359]
2021-06-02 14:02:39,453 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::G
2021-06-02 14:02:39,454 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::G
2021-06-02 14:02:39,454 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboo
2021-06-02 14:02:39,454 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x32
2021-06-02 14:02:39,455 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x147e7d136188]
2021-06-02 14:02:39,455 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (6) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x147fea62a9dd]
2021-06-02 14:02:39,456 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x147fea62a067]
2021-06-02 14:02:39,456 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   |   [bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x10da8) [0x147fea640da8]
2021-06-02 14:02:39,457 C:  3% D:1.0TB   M:118.8GB NODE:LOCAL1      25394  DATA   | 

This was where one system has 2 GPUs and the other worker has 1 GPU. You asked to re-test that. It hangs everything after the failure as well, which is also unfortunate.

~I’ll try homogeneous setup.~ Homogeneous setup with 1GPU for each of 2 nodes also fails same way.

  1. Single node but still using dask scheduler/worker and client, 2 GPUs HANGS:

I also tried single node but using dask scheduler/worker on the node, and it worked for a little bit and then hung. It seemed to work for GBM for hang for RF.

Thread 0x0000150f09b9f700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000150f037fb700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
  File "/home/jon/minicondadai_py38/lib/python3.8/queue.py", line 179 in get
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000150f035fa700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
  File "/home/jon/minicondadai_py38/lib/python3.8/queue.py", line 179 in get
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000150f02df6700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/selectors.py", line 468 in select
  File "/home/jon/minicondadai_py38/lib/python3.8/asyncio/base_events.py", line 1823 in _run_once
  File "/home/jon/minicondadai_py38/lib/python3.8/asyncio/base_events.py", line 570 in run_forever
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199 in start
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 430 in run_loop
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x0000150f18744700 (most recent call first):
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
  File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 558 in wait
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 350 in sync
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 843 in sync
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1610 in _client_sync
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1802 in fit

no errors appear in the hang.

  1. Single node with local cuda cluster model fitting. Looked like it was going ok at first, but then kept hitting:
[17:52:02] /workspace/xgboost/src/c_api/../data/../common/common.h:45: /workspace/xgboost/src/metric/../common/device_helpers.cuh: 1347: cudaErrorInvalidValue: invalid argument
Stack trace:
  [bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x457209) [0x14ffb9387209]
  [bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x190) [0x14ffb9389800]
  [bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(void dh::ArgSort<false, unsigned long, float const>(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::common::Span<unsigned long, 18446744073709551615ul>)+0x255) [0x14ffb9440655]
  [bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::GPUBinaryAUC(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::MetaInfo const&, int, std::shared_ptrxgboost::metric::DeviceAUCCache*)+0x1ee) [0x14ffb9430fde]
  [bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAUC::Eval(xgboost::HostDeviceVector const&, xgboost::MetaInfo const&, bool)+0x622) [0x14ffb9273cb2]
  [bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptrxgboost::DMatrix, std::allocator<std::shared_ptrxgboost::DMatrix > > const&, std::vector<std::string, std::allocatorstd::string > const&)+0x44f) [0x14ffb924d24f]
  [bt] (6) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x39d) [0x14ffb913a89d]
  [bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x15013b1a69dd]
  [bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x15013b1a6067]  

or hit sometimes:

[21:53:03] /workspace/xgboost/src/c_api/../data/../common/common.h:45: /workspace/xgboost/src/metric/../common/device_helpers.cuh: 1347: cudaErrorInvalidValue: invalid argument
Stack trace:
[bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x457209) [0x1500a1387209]
[bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x190) [0x1500a1389800]
[bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(void dh::ArgSort<false, unsigned long, float const>(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::common::Span<unsigned long, 18446744073709551615ul>)+0x255) [0x1500a1440655]
[bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::GPUBinaryAUC(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::MetaInfo const&, int, std::shared_ptrxgboost::metric::DeviceAUCCache*)+0x1ee) [0x1500a1430fde]
[bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAUC::Eval(xgboost::HostDeviceVector const&, xgboost::MetaInfo const&, bool)+0x622) [0x1500a1273cb2]
[bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptrxgboost::DMatrix, std::allocator<std::shared_ptrxgboost::DMatrix > > const&, std::vector<std::string, std::allocatorstd::string > const&)+0x44f) [0x1500a124d24f]
[bt] (6) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x39d) [0x1500a113a89d]
[bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x15023bcd39dd]
[bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x15023bcd3067]

Oddly once it started failing, it never recovered and every fit after fails.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 37 (37 by maintainers)

Most upvoted comments

If you saw cudaErrorInvalidValue, I suggest the first option to try is making sure HIDE_CXX_SYMBOLS. I encountered it a few times all caused by a corrupted stack due to c++ symbol conflicts between libraries.

So far I haven’t been able to reproduce it with compute-sanitizer. I tried empty dataset and non-empty dataset, snmg and mnmg. dask_cudf and dd.dataframe are also tested.