xgboost: Allreduce failed and boundary errors since v1.7.0 with dask and rabit
Since v1.7.0 I’m getting allreduce problems when training with dask. Using LocalCluster with 4 - 16 workers, 1 thread per worker. Example below is from 16 workers.
The dataset is huge and there are many features, I wasn’t able to create a reproducer. Instead I instrumented the code in rabit/src/allreduce_base.cc as you can see in the patch https://github.com/efocht/xgboost/commit/161ed3cd7dc7c762edf7f46a49d1bde2c45c3836
The problems start after a while, I can see that allreduce works fine many times but suddenly things go wrong:
[11:21:29] TryAllreduceRing byte_size=8 count=553842
[11:21:29] TryAllreduceRing byte_size=8 count=502914
[11:21:29] TryAllreduceRing byte_size=8 count=502914
[11:21:29] TryAllreduceRing byte_size=8 count=502914
[11:21:29] TryAllreduceRing byte_size=8 count=496548
[11:21:29] TryAllreduceRing byte_size=8 count=509280
[11:21:29] TryAllreduceRing byte_size=8 count=496548
[11:21:29] TryAllreduceRing byte_size=8 count=502914
[11:21:29] TryAllreduceRing byte_size=8 count=509280
[11:21:29] TryAllreduceRing byte_size=8 count=509280
[11:21:29] TryAllreduceRing byte_size=8 count=515646
[11:21:29] TryAllreduceRing byte_size=8 count=509280
[11:21:29] TryAllreduceRing byte_size=8 count=502914
[11:21:29] TryAllreduceRing byte_size=8 count=509280
[11:21:29] TryAllreduceRing byte_size=8 count=496548
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryReduceScatterRing ERROR end len = -1
[11:21:29] TryAllreduceRing byte_size=8 count=509280
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree ERROR read zero data from parent
[11:21:29] TryAllreduceTree ERROR read zero data from parent
[11:21:29] TryAllreduceTree ERROR read zero data from parent
[11:21:29] TryAllreduceTree ERROR read zero data from parent
[11:21:29] TryAllreduceTree ERROR read zero data from parent
[11:21:29] TryAllreduceTree byte_size=8 count=1
[11:21:29] TryAllreduceTree ERROR read zero data from parent
2023-04-28 11:21:30,490 - distributed.worker - WARNING - Compute Failed
Key: dispatched_train-0733b1eb-a437-40a4-a8d3-c2e13b4d5c0f
Function: dispatched_train
args: ({'max_depth': 14, 'eta': 0.3, 'monotone_constraints': (1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 'alpha': 0.0, 'reg_lambda': 1.0, 'gamma': 0.0, 'objective': 'rank:pairwise', 'scale_pos_weight': 1.0, 'colsample_bytree': 1, 'subsample': 1, 'min_child_weight': 1, 'booster': 'gbtree', 'tree_method': 'hist', 'max_bin': 256, 'seed': 42, 'nthread': 1, 'disable_default_eval_metric': 1, 'verbosity': 3}, {'DMLC_NUM_WORKER': 16, 'DMLC_
kwargs: {}
Exception: "XGBoostError('[11:21:29] /home/focht/NOI/xgboost/rabit/include/rabit/internal/utils.h:86: Allreduce: boundary error c_void_p(46956676228976), 1, 7, 2')"
2023-04-28 11:21:30,490 - distributed.worker - WARNING - Compute Failed
Key: dispatched_train-947ef6f6-81b6-4a15-8be2-3e56414211be
Function: dispatched_train
args: ({'max_depth': 14, 'eta': 0.3, 'monotone_constraints': (1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 'alpha': 0.0, 'reg_lambda': 1.0, 'gamma': 0.0, 'objective': 'rank:pairwise', 'scale_pos_weight': 1.0, 'colsample_bytree': 1, 'subsample': 1, 'min_child_weight': 1, 'booster': 'gbtree', 'tree_method': 'hist', 'max_bin': 256, 'seed': 42, 'nthread': 1, 'disable_default_eval_metric': 1, 'verbosity': 3}, {'DMLC_NUM_WORKER': 16, 'DMLC_
kwargs: {}
Exception: "XGBoostError('[11:21:29] /home/focht/NOI/xgboost/rabit/include/rabit/internal/utils.h:86: Allreduce failed c_void_p(47162700480048), 1, 7, 2')"
[...]
This doesn’t happen with versions older than v1.7.0. The case works fine with v1.6.2 and v1.5.2.
Do you have any idea what could go wrong? Or what I could adjust to get this working again?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 35 (19 by maintainers)
I will do some digging to ensure it’s actually fixed instead of some race conditions being hidden by performance change.