scikit-learn: CalibratedClassifierCV doesn't interact properly with Pipeline estimators

Hi,

I’m trying to use CalibratedClassifierCV to calibrate the probabilities from a Gradient Boosted Tree model. The GBM is wrapped in a Pipeline estimator, where the initial stages of the Pipeline convert categoricals (using DictVectorizer) prior to the GBM being fit. The issue is that when I try to similarly use CalibratedClassifierCV, with a prefit estimator, it fails when I pass in the data. Here’s a small example:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.calibration import CalibratedClassifierCV, _CalibratedClassifier
from sklearn.pipeline import Pipeline

fake_features = [
    {'state':'NY','age':'adult'},
    {'state':'TX','age':'adult'},
    {'state':'VT','age':'child'}
]

labels = [1,0,1]

pipeline = Pipeline([
            ('vectorizer',DictVectorizer()),
            ('clf',RandomForestClassifier())
    ])

pipeline.fit(fake_features, labels)

clf_isotonic = CalibratedClassifierCV(base_estimator=pipeline, cv='prefit', method='isotonic')
clf_isotonic.fit(fake_features, labels)

When running that, I get the following error on the last line:

TypeError: float() argument must be a string or a number, not 'dict'

On the other hand, if I replace the last two lines with the following, things work fine:

clf_isotonic = _CalibratedClassifier(base_estimator=pipeline, method='isotonic')
clf_isotonic.fit(fake_features, labels)

It seems that CalibratedClassifierCV checks to see if the X data is valid prior to invoking anything about the base estimator (https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/calibration.py#L126). In my case, this logic seems slightly off since I’m using the pipeline to convert the data into the proper form before feeding it into estimator.

On the other hand, _CalibratedClassifier doesn’t make this check first, so the code works (i.e. the data is fed into the pipeline, the model is fit, and then probabilities are calibrated appropriately).

My use case (which is not reflected in the example) is to use the initial stages of the pipeline to select columns from a dataframe, encode the categoricals, and then fit the model. I then pickle the fitted pipeline (after using GridSearchCV to select hyperparameters). Later on, I can load the model and use it to predict on new data, while abstracting away from what needs to be transformed in the raw data. I now want to calibrate the model after fitting it but ran into this problem.

For reference, here’s all my system info:

Linux-3.10.0-514.2.2.el7.x86_64-x86_64-with-redhat-7.3-Maipo
Python 3.6.0 |Continuum Analytics, Inc.| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.0
SciPy 0.18.1
Scikit-Learn 0.18.1

Thanks for reading (and for all of your hard work on scikit-learn!).

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 27 (14 by maintainers)

Commits related to this issue

Most upvoted comments

I am also interested in knowing the status of interplay between RandomizedSearchCV (or GridSearchCV) and CalibratedClassifierCV. I currently do hyper-parameter optimization with RandomizedSearchCV (which by default fits the best model to the entire dataset it was given) and then calibrate this model with CalibratedClassifierCV using cv = 'prefit' argument.

For anyone else reading this, I got it to work with model__base_estimator__max_depth as I guess the parameter is called base_estimator here.

It doesn’t seem to work with fit_params though.

fit_params = {
    'model__base_estimator__sample_weight': np.random.random(size=X.shape[0])
}

search = RandomizedSearchCV(my_pipeline, param_grid, cv=3, n_iter=4, iid=False)
search.fit(X, y, **fit_params)

TypeError: fit() got an unexpected keyword argument 'base_estimator__sample_weight

I guess it would go in this section https://scikit-learn.org/stable/modules/compose.html#nested-parameters

I think they are asking how to do this scenario.

from sklearn.datasets import make_moons
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.calibration import CalibratedClassifierCV

X, y = make_moons()

my_pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('model', CalibratedClassifierCV(RandomForestClassifier(n_estimators=10), cv=5))
])

param_grid = {
    'model__max_depth': list(range(2, 10))
}

search = RandomizedSearchCV(my_pipeline, param_grid, cv=3, n_iter=4, iid=False)
search.fit(X, y)

How do I modify param_grid so max_depth gets passed to RandomForestClassifier?

This is similar to how VotingClassifier works, but there the estimators parameter takes a tuple so you can give RandomForestClassifier a name and reference it in param_grid.

rf = RandomForestClassifier(n_estimators=10)

my_pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('model', VotingClassifier([('rf', rf)]))  # we can name it 'rf'
])

param_grid = {
    'model__rf__max_depth': list(range(2, 10))
}

search = RandomizedSearchCV(my_pipeline, param_grid, cv=3, n_iter=4, iid=False)
search.fit(X, y)
search.best_params_

I think the stackoverflow you cite is mostly a misunderstanding of syntax. If you want us to properly understand your issue here, please provide code to explain what you’re doing / trying.

I’m currently running into the same issue.

Is this clearly an input-validation-issue, so that the encoding of categorical data won’t be attempted before an error is thrown? Is there a way to disable this input-validation locally?

(I am also not familiar with _CalibratedClassifier and how to use it)

So this wasn’t in my example above, but the pipeline gets put into GridSearchCV and then I want to calibrate the model chosen from GridSearchCV. How does GridSearchCV interact with CalibratedClassifierCV?

From my limited understanding, CalibratedClassifierCV produces K models, where each model is constructed with a K-1 folds, and then averages predictions from each of the K models to make a prediction. This seems semantically different from standard CV where you select the model with the best performance over the K folds but then construct a single model using the entire training data. I’m not sure how to include the CalibrationCV into that procedure.

The thing that made sense to me with CalibratedClassifierCV was to prefit the model from GridSearchCV and then calibrate with a different set of data (as the per the docs recommendation).

I should have noted that in my actual use case, I get a different error

ValueError: could not convert string to float:

which is caused by a column of strings in the DataFrame that I pass. In both cases, it is an error about the input data.