scikit-learn: check_is_fitted has false positives on custom subclasses with private attributes

Description

check_is_fitted has false positives on custom subclasses with private attributes.

I believe check_is_fitted should not look at variables with a leading underscore because 1) that is Python’s convention for private attributes, and 2) the scikit-learn API specifies fitted attributes as those with a trailing underscore (and there is no specification regarding leading underscores, correct?).

Backtracking from PR #14545 where the new check logic was introduced, I noticed that the check for leading underscore was added to cover these two modules:

  • sklearn/neighbors/base.py
  • sklearn/neighbors/lof.py

But perhaps these modules are actually not following the API specification?

Steps/Code to Reproduce

class MyPCA(PCA):
    def __init__(self, ...):  # omitted arguments for brevity
        super().__init__(...)
        self._my_private_attr = 42

mypca = MyPCA()
check_is_fitted(mypca)  # does not raise NotFittedError

Expected Results

check_is_fitted raises NotFittedError even on custom subclasses that have private attributes following the Python convention of a leading underscore.

Actual Results

NotFittedError is not raised.

Versions

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.7.3 (default, Aug  8 2019, 19:40:58)  [GCC 5.4.0 20160609]
executable: /media/ale/data/education+research/code/baikal/venv/bin/python
   machine: Linux-4.4.0-170-generic-x86_64-with-debian-stretch-sid

Python dependencies:
       pip: 19.3.1
setuptools: 40.8.0
   sklearn: 0.22
     numpy: 1.17.4
     scipy: 1.3.3
    Cython: None
    pandas: None
matplotlib: None
    joblib: 0.14.0

Built with OpenMP: True

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 37 (28 by maintainers)

Most upvoted comments

Fwiw I am okay with removing the attr.startswith(“_”) condition. For the sake of the future I think removing it is the better option. But because the intention is for the estimator to use the function internally, it doesn’t matter too much.

Our convention with underscores is fairly implicit. I would prefer a more explicit list of attributes that define all the attributes that will be learnt during training, thus allowing private attributes to be defined in init.

class MyPCA:
    fitted_attributes = ['n_components_']

    def __init__(self, ...):
       self._private_attr = 42

    def fit(self,...):
        self.n_components_ = 4
	    self._learned_attr = 10

check_is_fitted will only check for fitted_attributes. (Yes this kind of goes back to the former version of check_is_fitted)

I have seen this issue come up in skorch as well, i.e. this change in check_is_fitted is kind of breaking third party estimators.

Take TfidfVectorizer, for example. It sets a non-param private attribute in its init, and that is a reasonable thing to do.

Thanks a lot! Yes, this is a serious problem, i.e., check_is_fitted won’t work for TfidfVectorizer.

from sklearn.utils.validation import check_is_fitted
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
check_is_fitted(vectorizer)
# pass

I can imagine reasons to set private attributes in __init__, and as long as they are not derived from parameters they should technically be okay with our APIs.

I have updated https://github.com/scikit-learn/scikit-learn/pull/15947 to remove the check for the leading _ in check_is_fitted. I also updated the TF-IDF classes to use it properly. I think this is ready for merge for 0.22.1.

Do you have a more precise description of specific use cases for this?

I needed it for one of the drafts of the feature_names implementation. But I can’t find it right now, so I’m not sure it actually had a good reason. I know that yellowbrick has a usecase in that they want to call fit unless an estimator has been fit before.

The point of the change in check_is_fitted was to allow downstream libraries to check if an estimator was fitted without having to create data, as it might be hard to create the right kind of data to do a try: est.fit() to see if the estimator has been fitted.

I think generally having an attribute or a method is fine. I didn’t go this way because we generally don’t like extending the API. It would be much more explicit than a badly documented convention though.

I think one of the things we’re coming up against here is again the distinction of third-party use and within-sklearn-use, both of the estimator checks and the utils. We really need to think more about this.

It’s noteworthy that right now, check_estimator does not use check_is_fitted anywhere, so check_is_fitted should not be considered party of the API specification. So it does not enforce anything on third-party classes. I think your point is that it does enforce some restrictions on third-party estimators that inherit from existing estimators.

Ot might have been better to make the version that has no parameters private from the start as it was geared towards internal use. If we want a consistent way to check whether something was fitted for anything that’s scikit-learn compatible, indeed check_is_fitted is not the right choice and an attribute or method would be better.

I’m happy with the short-term solution. The question for the medium-term solution is what the purpose of check_is_fitted is. There is two implicit purposes: the original purpose was to make it easy to raise the right exception if an estimator calls predict without being fitted. I abused / extended that for calling it outside the estimator itself, to check the state of an estimator (which was not possible before).

Both of these are useful, but distinct functions, for third party libraries. If we want the ability to check the state from the outside, a method is probably a good way to go. If we then also want to make it easy to raise the right exception, we either need to provide a _check_is_fitted private method, or need to bake it into _validate_data (https://github.com/scikit-learn/enhancement_proposals/pull/22/files), both of which are more “frameworky” than what we usually like.

@alegonz Thanks for the thoughtful reply.

Regarding the short-term solution, I think that it makes sense. Then for the long-term solution, up-to-now I think that scikit-learn avoided using the abstract mechanisms on the BaseEstimator. I would think that it was done on purpose, probably @amueller , @jnothman, @GaelVaroquaux, @ogrisel could give some thoughts. Instead, the check_estimator can be used to ensure that an estimator raises the NotFittedError when it is supposed to do so.

But in general I think your suggestions are good.

We tend to proscribe setting anything in init aside from parameter copies because we don’t want estimators to validate or set derived attributes of these parameters. Doing so can break set_params.

Just to be clear… is check_is_fitted used within the estimator or by meta-estimators? it was clearer in the 0.21 interface that it was for use within the estimator, not by externals like meta-estimators. If that remains the case (and should be more clearly documented), then the fact that some estimators must call check_is_fitted with attributes passed is no problem.

If check_is_fitted is intended to be used by meta-estimators, then I agree, it’s not working well enough. In the future, checking the presence of n_features_in_ should be sufficient.

How about we make it continue to be automagic, except where attributes is specified. I.e. we stop deprecating attributes

I prefer to revert before we come up with a better solution.

Again, we’ve always done deprecation cycles in utils. cf the dozen of PRs related to https://github.com/scikit-learn/scikit-learn/issues/6616#issuecomment-516479413