dask-ml: GridSearchCV fails with XGBoost models

The following code fails:

import numpy as np
from dask_ml.model_selection import GridSearchCV, KFold
from dask_xgboost import XGBClassifier

x = np.random.randn(100, 2)
y = np.random.randint(0, 1, 100)

params = {'max_depth': [2, 3]}

clf = GridSearchCV(XGBClassifier(), params, cv=KFold(n_splits=2), scoring='neg_mean_squared_error')
clf.fit(x, y)

Stack trace:

  File "/home/jin/anaconda3/envs/ml-gpu/lib/python3.6/site-packages/dask_ml/model_selection/_normalize.py", line 38, in normalize_estimator
    val = getattr(est, attr)
  File "/home/jin/anaconda3/envs/ml-gpu/lib/python3.6/site-packages/xgboost/sklearn.py", line 536, in feature_importances_
    b = self.get_booster()
  File "/home/jin/anaconda3/envs/ml-gpu/lib/python3.6/site-packages/xgboost/sklearn.py", line 200, in get_booster
    raise XGBoostError('need to call fit or load_model beforehand')
xgboost.core.XGBoostError: need to call fit or load_model beforehand

The error arises because the normalize_estimator function is trying to get the ‘feature_importances_’ attribute of the xgb model, but the attribute doesn’t exist unless the model has already been trained. The normalize_estimator function I believe should gloss over this error, but it doesn’t because it’s not catching the XGBoostError error type.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 22 (7 by maintainers)

Most upvoted comments

I’m running into the same issue today. Do I understand correctly that there’s no way to implement dask_xgboost.XGBoostClassifier with dask_ml.model_selection.GridSearchCV ?

dask_xgboost.XGBoostClassifier doesn’t implement out-of-core training. It loads the data into distributed memory, and hands that off to xgboost’s distributed runtime.

My understanding is that the native xgboost.dask.XGBoostClassifier does the same thing.

On Mon, Jun 8, 2020 at 9:55 PM Gideon Blinick notifications@github.com wrote:

Hi @trivialfis https://github.com/trivialfis,

The reason we’d want to use dask’s XGBClassifier is to avoid loading all data into memory at once.

Dask XGBoost seems to have a memory leakage issue though.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/667#issuecomment-640997713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITK6VKNEEPW5D2YYRLRVWQD3ANCNFSM4NBYBPNA .

Ah, I missed that. Thanks @TomAugspurger and @trivialfis .

To recap then, the original code needs to be modified in 3 ways:

  1. upgrade xgboost library to 1.1.0.
  2. use the xgboost library for the classifier instead of dask_xgboost.
  3. use sklearn.model_selection.KFold instead of dask_ml.model_selection.KFold

So dask only enters this code then as the GridSearchCV.

Spent some time on passing the estimator checks from skl, it should actually be NotFittedError.

I think this is an issue with XGBoost. I think that the error they raise there should be subclassing AttributeError to make it clear that you’re accessing a non-existent attribute.

Can you propose that change on xgboost and link to the discussion here? If they decline that change we could perhaps include XGBoostError in the exceptions we catch, but I would need to think about that more.