xgboost: Training many models with gpu_hist in Optuna yields ‘parallel_for failed: out of memory’
Hi, I am having an issue using XGBClassifier on GPU running OOM and tried to implement a workaround by saving the model, deleting the model and loading it back in.
pickle.dump(self.model, open(f'tmp/model_{uid}.pkl', 'wb'))
del self.model
self.model = pickle.load(open(f'tmp/model_{uid}.pkl', 'rb'))
os.remove(f'tmp/model_{uid}.pkl')
I am on xgb 1.3.0 and the models are very small. I am running a HO with Optuna with a 1000x Bootstrapping CV in each iteration. After 50 - 120 Optuna iteration, it throws the error:
xgboost.core.XGBoostError: [16:11:48] ../src/tree/updater_gpu_hist.cu:731: Exception in gpu_hist: NCCL failure :unhandled cuda error ../src/common/device_helpers.cu(71)
and
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: out of memory
Looking at nvidia-smi it only takes a constant ~210 MB… (RTX TITAN)
My parameter space looks like this:
params = {
'booster': 'gbtree',
'objective': 'binary:logistic',
'tree_method': 'gpu_hist',
'random_state': self.random_state,
'predictor': 'cpu_predictor',
'n_estimators' : 100,
'reg_alpha': 0,
'reg_lambda': 1,
'min_child_weight': 1,
'max_depth': trial.suggest_int('max_depth', 2, 6),
'gamma': trial.suggest_discrete_uniform('gamma', 0, 10, 0.1),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.5),
'subsample': trial.suggest_discrete_uniform('subsample', 0.3, 1.0, 0.05),
'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.1, 1.0, 0.1)
}
I thought this is related to issue https://github.com/dmlc/xgboost/issues/4668, but I am not sure about that anymore.
BTW, everything works fine running the same code on CPU. Other libraries like RAPIDS cuML are working fine on GPU.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 15 (6 by maintainers)
This is my code, which stops after 28 rounds with the errors stated above.