xgboost: /workspace/xgboost/rabit/include/rabit/internal/utils.h:90: Allreduce failed
Turned on early stopping and even for just single node 2 GPU case I’m getting this error.
/workspace/xgboost/rabit/include/rabit/internal/utils.h:90: Allreduce failed
It’s related is https://github.com/dmlc/xgboost/issues/6272 but I get this allreduce error without the other error. So I wanted to post. However, if one ensures the eval_set has sufficient partitions across the dask workers one does not hit this problem. The error is a bit confusing by itself, but the worker logs show other errors like empty dataset.
Maybe the error provided by xgboost can be improved.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 31 (30 by maintainers)
I also have reports from colleagues that it happens even without a cluster, just local cluster mode, which I’ll check on.