xgboost: dask multinode: /workspace/xgboost/src/tree/updater_gpu_hist.cu:794: Exception in gpu_hist: NCCL failure :unhandled system error /workspace/xgboost/src
@trivialfis We got done with upgrading to latest python3.8 + rapids 0.19.2 and all basic things working, including dask on single node. I’m now testing multinode and it seems worse than before. I immediately get the below error:
- 2 nodes, 2 GPUs on scheduler, 1 on worker (https://github.com/dmlc/xgboost/issues/6397): CRASHES OR 2 nodes, 1 GPU on scheduler, 1 on worker :CRASHES
2021-06-02 14:02:39,438 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | model.fit(X, y, sample_weight=sample_weight, **kwargs)
2021-06-02 14:02:39,439 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1802, in fit
2021-06-02 14:02:39,439 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | return self._client_sync(self._fit_async, **args)
2021-06-02 14:02:39,440 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1610, in _client_sync
2021-06-02 14:02:39,440 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | return self.client.sync(func, **kwargs, asynchronous=asynchronous)
2021-06-02 14:02:39,441 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 843, in sync
2021-06-02 14:02:39,441 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | return sync(
2021-06-02 14:02:39,442 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 353, in sync
2021-06-02 14:02:39,442 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | raise exc.with_traceback(tb)
2021-06-02 14:02:39,442 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 336, in f
2021-06-02 14:02:39,443 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | result[0] = yield future
2021-06-02 14:02:39,443 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
2021-06-02 14:02:39,444 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | value = future.result()
2021-06-02 14:02:39,444 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1760, in _fit_async
2021-06-02 14:02:39,445 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | results = await self.client.sync(
2021-06-02 14:02:39,445 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 915, in _train_async
2021-06-02 14:02:39,446 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | results = await client.gather(futures, asynchronous=True)
2021-06-02 14:02:39,446 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 1840, in _gather
2021-06-02 14:02:39,446 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | raise exception.with_traceback(traceback)
2021-06-02 14:02:39,447 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 870, in dispatched_train
2021-06-02 14:02:39,447 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | bst = worker_train(params=local_param,
2021-06-02 14:02:39,448 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/training.py", line 189, in train
2021-06-02 14:02:39,448 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | bst = _train_internal(params, dtrain,
2021-06-02 14:02:39,449 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/training.py", line 81, in _train_internal
2021-06-02 14:02:39,449 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | bst.update(dtrain, i, obj)
2021-06-02 14:02:39,449 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/core.py", line 1571, in update
2021-06-02 14:02:39,450 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
2021-06-02 14:02:39,450 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/core.py", line 214, in _check_call
2021-06-02 14:02:39,451 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | raise XGBoostError(py_str(_LIB.XGBGetLastError()))
2021-06-02 14:02:39,451 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | xgboost.core.XGBoostError: [14:02:39] /workspace/xgboost/src/tree/updater_gpu_hist.cu:794: Exception in gpu_hist: NCCL failure :unhandled system error /workspace/xgboost/src
2021-06-02 14:02:39,452 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA |
2021-06-02 14:02:39,452 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | Stack trace:
2021-06-02 14:02:39,453 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x63e359) [0x147e7d56e359]
2021-06-02 14:02:39,453 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::G
2021-06-02 14:02:39,454 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::G
2021-06-02 14:02:39,454 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboo
2021-06-02 14:02:39,454 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x32
2021-06-02 14:02:39,455 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x147e7d136188]
2021-06-02 14:02:39,455 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (6) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x147fea62a9dd]
2021-06-02 14:02:39,456 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x147fea62a067]
2021-06-02 14:02:39,456 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA | [bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x10da8) [0x147fea640da8]
2021-06-02 14:02:39,457 C: 3% D:1.0TB M:118.8GB NODE:LOCAL1 25394 DATA |
This was where one system has 2 GPUs and the other worker has 1 GPU. You asked to re-test that. It hangs everything after the failure as well, which is also unfortunate.
~I’ll try homogeneous setup.~ Homogeneous setup with 1GPU for each of 2 nodes also fails same way.
- Single node but still using dask scheduler/worker and client, 2 GPUs HANGS:
I also tried single node but using dask scheduler/worker on the node, and it worked for a little bit and then hung. It seemed to work for GBM for hang for RF.
Thread 0x0000150f09b9f700 (most recent call first):
File "/home/jon/minicondadai_py38/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x0000150f037fb700 (most recent call first):
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
File "/home/jon/minicondadai_py38/lib/python3.8/queue.py", line 179 in get
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x0000150f035fa700 (most recent call first):
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
File "/home/jon/minicondadai_py38/lib/python3.8/queue.py", line 179 in get
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x0000150f02df6700 (most recent call first):
File "/home/jon/minicondadai_py38/lib/python3.8/selectors.py", line 468 in select
File "/home/jon/minicondadai_py38/lib/python3.8/asyncio/base_events.py", line 1823 in _run_once
File "/home/jon/minicondadai_py38/lib/python3.8/asyncio/base_events.py", line 570 in run_forever
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199 in start
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 430 in run_loop
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 870 in run
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 890 in _bootstrap
Current thread 0x0000150f18744700 (most recent call first):
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 306 in wait
File "/home/jon/minicondadai_py38/lib/python3.8/threading.py", line 558 in wait
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/utils.py", line 350 in sync
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/distributed/client.py", line 843 in sync
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1610 in _client_sync
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/dask.py", line 1802 in fit
no errors appear in the hang.
- Single node with local cuda cluster model fitting. Looked like it was going ok at first, but then kept hitting:
[17:52:02] /workspace/xgboost/src/c_api/../data/../common/common.h:45: /workspace/xgboost/src/metric/../common/device_helpers.cuh: 1347: cudaErrorInvalidValue: invalid argument
Stack trace:
[bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x457209) [0x14ffb9387209]
[bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x190) [0x14ffb9389800]
[bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(void dh::ArgSort<false, unsigned long, float const>(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::common::Span<unsigned long, 18446744073709551615ul>)+0x255) [0x14ffb9440655]
[bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::GPUBinaryAUC(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::MetaInfo const&, int, std::shared_ptrxgboost::metric::DeviceAUCCache*)+0x1ee) [0x14ffb9430fde]
[bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAUC::Eval(xgboost::HostDeviceVector const&, xgboost::MetaInfo const&, bool)+0x622) [0x14ffb9273cb2]
[bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptrxgboost::DMatrix, std::allocator<std::shared_ptrxgboost::DMatrix > > const&, std::vector<std::string, std::allocatorstd::string > const&)+0x44f) [0x14ffb924d24f]
[bt] (6) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x39d) [0x14ffb913a89d]
[bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x15013b1a69dd]
[bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x15013b1a6067]
or hit sometimes:
[21:53:03] /workspace/xgboost/src/c_api/../data/../common/common.h:45: /workspace/xgboost/src/metric/../common/device_helpers.cuh: 1347: cudaErrorInvalidValue: invalid argument
Stack trace:
[bt] (0) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x457209) [0x1500a1387209]
[bt] (1) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x190) [0x1500a1389800]
[bt] (2) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(void dh::ArgSort<false, unsigned long, float const>(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::common::Span<unsigned long, 18446744073709551615ul>)+0x255) [0x1500a1440655]
[bt] (3) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::GPUBinaryAUC(xgboost::common::Span<float const, 18446744073709551615ul>, xgboost::MetaInfo const&, int, std::shared_ptrxgboost::metric::DeviceAUCCache*)+0x1ee) [0x1500a1430fde]
[bt] (4) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAUC::Eval(xgboost::HostDeviceVector const&, xgboost::MetaInfo const&, bool)+0x622) [0x1500a1273cb2]
[bt] (5) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptrxgboost::DMatrix, std::allocator<std::shared_ptrxgboost::DMatrix > > const&, std::vector<std::string, std::allocatorstd::string > const&)+0x44f) [0x1500a124d24f]
[bt] (6) /home/jon/minicondadai_py38/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x39d) [0x1500a113a89d]
[bt] (7) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x15023bcd39dd]
[bt] (8) /home/jon/minicondadai_py38/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x15023bcd3067]
Oddly once it started failing, it never recovered and every fit after fails.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 37 (37 by maintainers)
If you saw
cudaErrorInvalidValue, I suggest the first option to try is making sureHIDE_CXX_SYMBOLS. I encountered it a few times all caused by a corrupted stack due to c++ symbol conflicts between libraries.So far I haven’t been able to reproduce it with compute-sanitizer. I tried empty dataset and non-empty dataset, snmg and mnmg. dask_cudf and dd.dataframe are also tested.