ray: [Train] XGBoost continue train (resume_from_checkpoint) and get_model failed
What happened + What you expected to happen
When I finish XGBoost training using XGBoostTrainer I want to continue training on the best checkpoint
- Assign
resume_from_checkpoint
failed to load the checkpoint XGBoostTrainer.get_model
can’t get the checkpoint either.
The first issue error message happens when creating a new trainer with resume_from_checkpoint
and is quite similar to this https://github.com/ray-project/ray/issues/16375
2023-12-05 10:52:43,353 WARNING experiment_state.py:327 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-12-05 10:52:53,378 INFO tune.py:1047 -- Total run time: 11.02 seconds (0.14 seconds for the tuning loop).
2023-12-05 10:52:53,393 WARNING experiment_analysis.py:185 -- Failed to fetch metrics for 1 trial(s):
- MyXGBoostTrainer_5c19d_00000: FileNotFoundError('Could not fetch metrics for MyXGBoostTrainer_5c19d_00000: both result.json and progress.csv were not found at /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/MyXGBoostTrainer_5c19d_00000_0_2023-12-05_10-52-42')
This error message will be like the second one when I remove the early stop config stop=ExperimentPlateauStopper('train-error', mode='min')
in RunConfig
xgboost.core.XGBoostError: [11:25:04] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
[bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f768b5dc24e]
[bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f768b6086f3]
[bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f768b590731]
[bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f768b5909f9]
[bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe0959829dd]
[bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe095982067]
[bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe09599b1e9]
[bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7fe09599bc95]
[bt] (8) ray::_Inner.train(_PyObject_MakeTpCall+0x3bf) [0x55b5d791b13f]
And the second issue might be relevant to this https://github.com/ray-project/ray/issues/41374 Either Ray saves the XGBoost model to legacy binary or cannot load the non-default model name from the checkpoint. The workaround seems not working.
Where there are warning logs like this
(XGBoostTrainer pid=96776, ip=192.168.222.237) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000015)
(XGBoostTrainer pid=96776, ip=192.168.222.237) [11:16:41] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
And if use XGBoostTrainer.get_model(checkpoint)
will get error
XGBoostError: [11:16:54] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000017/model.json failed: No such file or directory
Stack trace:
[bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f49ef86824e]
[bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f49ef8946f3]
[bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f49ef81c731]
[bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f49ef81c9f9]
[bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f4b8c2bd9dd]
[bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f4b8c2bd067]
[bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f4b8c2d61e9]
[bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f4b8c2d6c95]
[bt] (8) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/bin/python(_PyObject_MakeTpCall+0x3bf) [0x5581bcf8513f]
Versions / Dependencies
Python 3.8.13
Packages
ray 2.8.1
xgboost-ray 0.1.19
xgboost 2.0.2
OS
Distributor ID: Ubuntu
Description: Ubuntu 18.04.6 LTS
Release: 18.04
Codename: bionic
Reproduction script
The reproduction script is based on the official tutorial Get Started with XGBoost and LightGBM — Ray 2.8.0
Load data and do the first training
import ray
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig, RunConfig, CheckpointConfig, FailureConfig
from ray.tune.stopper import ExperimentPlateauStopper
ray.init()
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv"))
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3
run_config = RunConfig(
name="XGBoost_ResumeExperiment",
storage_path="/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug",
checkpoint_config=CheckpointConfig(
checkpoint_frequency=1,
num_to_keep=10,
checkpoint_at_end=True,
checkpoint_score_attribute='train-error',
checkpoint_score_order='min',
),
failure_config=FailureConfig(max_failures=2),
# Remove this will get different error message later
stop=ExperimentPlateauStopper('train-error', mode='min'),
)
scaling_config = ScalingConfig(
num_workers=3,
placement_strategy="SPREAD",
use_gpu=False,
)
trainer = XGBoostTrainer(
scaling_config=scaling_config,
run_config=run_config,
label_column="target",
num_boost_round=20,
params={
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
},
datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()
During fitting will get warnings like this
(XGBoostTrainer pid=96776, ip=192.168.222.237) [11:16:42] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
(XGBoostTrainer pid=96776, ip=192.168.222.237) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000018)
Get the Best Checkpoint and Resume
checkpoint = result.get_best_checkpoint('valid-logloss', 'min')
trainer_continue = XGBoostTrainer(
scaling_config=scaling_config,
run_config=run_config,
label_column="target",
num_boost_round=20,
params={
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
},
datasets={"train": train_dataset, "valid": valid_dataset},
resume_from_checkpoint=checkpoint
)
result_continue = trainer_continue.fit()
This will get an error like this when enabling early stopping
2023-12-05 10:25:41,638 WARNING experiment_state.py:327 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-12-05 10:25:50,900 INFO tune.py:1047 -- Total run time: 9.96 seconds (0.14 seconds for the tuning loop).
2023-12-05 10:25:50,911 WARNING experiment_analysis.py:185 -- Failed to fetch metrics for 1 trial(s):
- MyXGBoostTrainer_95a7d_00000: FileNotFoundError('Could not fetch metrics for MyXGBoostTrainer_95a7d_00000: both result.json and progress.csv were not found at /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/MyXGBoostTrainer_95a7d_00000_0_2023-12-05_10-25-40')
And error like this without an early stopping
xgboost.core.XGBoostError: [11:25:25] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
[bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f2f6f5dc24e]
[bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f2f6f6086f3]
[bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f2f6f590731]
[bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f2f6f5909f9]
[bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f9976bab9dd]
[bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f9976bab067]
[bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f9976bc41e9]
[bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f9976bc4c95]
[bt] (8) ray::_Inner.train(_PyObject_MakeTpCall+0x3bf) [0x556d2854d13f]
Which will be the same as
model = XGBoostTrainer.get_model(checkpoint)
XGBoostError: [11:36:40] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
[bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f105a97824e]
[bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f105a9a46f3]
[bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f105a92c731]
[bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f105a92c9f9]
[bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f11f73d09dd]
[bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f11f73d0067]
[bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f11f73e91e9]
[bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f11f73e9c95]
[bt] (8) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/bin/python(_PyObject_MakeTpCall+0x3bf) [0x55ea790af13f]
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 17 (4 by maintainers)
Thank you for the investigation! The
checkpoint_at_end
andcheckpoint_frequency
do indeed go through different codepaths, and I was able to reproduce withcheckpoint_frequency=1
. I’ll put up a fix PR to clean this up!I tried to print the package version by doing this
And get the following logs
Not sure if this confirms they are using the same packages.
I have three machines. I set up the workspace (
/mnt/NAS/ShareFolder/MyRepo
) in a NAS directory which are accessible for these three machine and have mounted under the same directory structure. In the workspace, I created a Python 3.8.13 virtual environment (/mnt/NAS/ShareFolder/MyRepo/MyVenv
), which installedray==2.8.1
,xgboost-ray==0.1.19
,xgboost==2.0.2
.And I start the cluster like this
In this script, I have a RunConfig like this, which directs the checkpoint to the NAS share folder
If the Ray version is inconsistent, it will raise an error at the cluster starting phase, but I am not sure if it will warn for other packages.