dask-ml: GridSearchCV fails with XGBoost models
The following code fails:
import numpy as np
from dask_ml.model_selection import GridSearchCV, KFold
from dask_xgboost import XGBClassifier
x = np.random.randn(100, 2)
y = np.random.randint(0, 1, 100)
params = {'max_depth': [2, 3]}
clf = GridSearchCV(XGBClassifier(), params, cv=KFold(n_splits=2), scoring='neg_mean_squared_error')
clf.fit(x, y)
Stack trace:
File "/home/jin/anaconda3/envs/ml-gpu/lib/python3.6/site-packages/dask_ml/model_selection/_normalize.py", line 38, in normalize_estimator
val = getattr(est, attr)
File "/home/jin/anaconda3/envs/ml-gpu/lib/python3.6/site-packages/xgboost/sklearn.py", line 536, in feature_importances_
b = self.get_booster()
File "/home/jin/anaconda3/envs/ml-gpu/lib/python3.6/site-packages/xgboost/sklearn.py", line 200, in get_booster
raise XGBoostError('need to call fit or load_model beforehand')
xgboost.core.XGBoostError: need to call fit or load_model beforehand
The error arises because the normalize_estimator function is trying to get the ‘feature_importances_’ attribute of the xgb model, but the attribute doesn’t exist unless the model has already been trained. The normalize_estimator function I believe should gloss over this error, but it doesn’t because it’s not catching the XGBoostError error type.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 22 (7 by maintainers)
I’m running into the same issue today. Do I understand correctly that there’s no way to implement
dask_xgboost.XGBoostClassifierwithdask_ml.model_selection.GridSearchCV?dask_xgboost.XGBoostClassifierdoesn’t implement out-of-core training. It loads the data into distributed memory, and hands that off to xgboost’s distributed runtime.My understanding is that the native
xgboost.dask.XGBoostClassifierdoes the same thing.On Mon, Jun 8, 2020 at 9:55 PM Gideon Blinick notifications@github.com wrote:
Ah, I missed that. Thanks @TomAugspurger and @trivialfis .
To recap then, the original code needs to be modified in 3 ways:
So dask only enters this code then as the GridSearchCV.
Spent some time on passing the estimator checks from skl, it should actually be
NotFittedError.I think this is an issue with XGBoost. I think that the error they raise there should be subclassing
AttributeErrorto make it clear that you’re accessing a non-existent attribute.Can you propose that change on xgboost and link to the discussion here? If they decline that change we could perhaps include XGBoostError in the exceptions we catch, but I would need to think about that more.