xgboost: Allreduce error when continuing training from the same process (Ray elastic training)
TLDR: Since XGBoost 1.5, XGBoost-Ray’s elastic training fails (it works with XGBoost 1.4). I suspect there may be retained state as it works when all actors are re-created.
XGBoost-Ray uses Ray’s actor model to reduce data loading overhead when remote training workers die.
In the elastic training test, we do the following:
- We start a remote Ray actor on four different nodes
- Technically, these are separate long living Python processes
- These processes start a thread which connects to the Rabit tracker and call the native
xgb.train()method
After a number of iterations (15), we kill one of the actors. This actor is then re-started. The other actors are re-used.
However, when continuing training, existing actors fail with
File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost/training.py", line 196, in train
early_stopping_rounds=early_stopping_rounds)
File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost/training.py", line 81, in _train_internal
bst.update(dtrain, i, obj)
File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost/core.py", line 1682, in update
dtrain.handle))
File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost/core.py", line 218, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [08:05:56] ../rabit/include/rabit/internal/utils.h:90: Allreduce failed
This is also true when not restoring from a checkpoint.
This does not happen when we re-create all actors.
The bug does not come up in XGBoost < 1.5, only in the latest release.
Are you aware of any changes in XGBoost 1.5 that maybe retain state across multiple calls to xgboost.train? As explained above, the actors retain their state and the same PID, but the xgb.train() call is always in a separate thread, which is ended for all actors when a single actor fails. We also restart the Rabit tracker between runs (and I also tried it with different ports for the Rabit tracker, all with the same result).
Any help would be much appreciated, thanks!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (15 by maintainers)
Let me know if there’s anything I can help. Also, it’s quite pleasant to read the code as it’s very well written. 😉