xgboost: /workspace/xgboost/rabit/include/rabit/internal/utils.h:90: Allreduce failed

@trivialfis

Turned on early stopping and even for just single node 2 GPU case I’m getting this error.

/workspace/xgboost/rabit/include/rabit/internal/utils.h:90: Allreduce failed  

It’s related is https://github.com/dmlc/xgboost/issues/6272 but I get this allreduce error without the other error. So I wanted to post. However, if one ensures the eval_set has sufficient partitions across the dask workers one does not hit this problem. The error is a bit confusing by itself, but the worker logs show other errors like empty dataset.

Maybe the error provided by xgboost can be improved.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 31 (30 by maintainers)

Most upvoted comments

I also have reports from colleagues that it happens even without a cluster, just local cluster mode, which I’ll check on.