scikit-learn: check_is_fitted has false positives on custom subclasses with private attributes
Description
check_is_fitted
has false positives on custom subclasses with private attributes.
I believe check_is_fitted
should not look at variables with a leading underscore because 1) that is Python’s convention for private attributes, and 2) the scikit-learn API specifies fitted attributes as those with a trailing underscore (and there is no specification regarding leading underscores, correct?).
Backtracking from PR #14545 where the new check logic was introduced, I noticed that the check for leading underscore was added to cover these two modules:
sklearn/neighbors/base.py
sklearn/neighbors/lof.py
But perhaps these modules are actually not following the API specification?
Steps/Code to Reproduce
class MyPCA(PCA):
def __init__(self, ...): # omitted arguments for brevity
super().__init__(...)
self._my_private_attr = 42
mypca = MyPCA()
check_is_fitted(mypca) # does not raise NotFittedError
Expected Results
check_is_fitted
raises NotFittedError
even on custom subclasses that have private attributes following the Python convention of a leading underscore.
Actual Results
NotFittedError
is not raised.
Versions
>>> import sklearn; sklearn.show_versions()
System:
python: 3.7.3 (default, Aug 8 2019, 19:40:58) [GCC 5.4.0 20160609]
executable: /media/ale/data/education+research/code/baikal/venv/bin/python
machine: Linux-4.4.0-170-generic-x86_64-with-debian-stretch-sid
Python dependencies:
pip: 19.3.1
setuptools: 40.8.0
sklearn: 0.22
numpy: 1.17.4
scipy: 1.3.3
Cython: None
pandas: None
matplotlib: None
joblib: 0.14.0
Built with OpenMP: True
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 37 (28 by maintainers)
Fwiw I am okay with removing the attr.startswith(“_”) condition. For the sake of the future I think removing it is the better option. But because the intention is for the estimator to use the function internally, it doesn’t matter too much.
Our convention with underscores is fairly implicit. I would prefer a more explicit list of attributes that define all the attributes that will be learnt during training, thus allowing private attributes to be defined in init.
check_is_fitted
will only check forfitted_attributes
. (Yes this kind of goes back to the former version ofcheck_is_fitted
)I have seen this issue come up in
skorch
as well, i.e. this change incheck_is_fitted
is kind of breaking third party estimators.Thanks a lot! Yes, this is a serious problem, i.e., check_is_fitted won’t work for TfidfVectorizer.
I can imagine reasons to set private attributes in
__init__
, and as long as they are not derived from parameters they should technically be okay with our APIs.I have updated https://github.com/scikit-learn/scikit-learn/pull/15947 to remove the check for the leading
_
incheck_is_fitted
. I also updated the TF-IDF classes to use it properly. I think this is ready for merge for 0.22.1.I needed it for one of the drafts of the feature_names implementation. But I can’t find it right now, so I’m not sure it actually had a good reason. I know that yellowbrick has a usecase in that they want to call fit unless an estimator has been fit before.
The point of the change in
check_is_fitted
was to allow downstream libraries to check if an estimator was fitted without having to create data, as it might be hard to create the right kind of data to do atry: est.fit()
to see if the estimator has been fitted.I think generally having an attribute or a method is fine. I didn’t go this way because we generally don’t like extending the API. It would be much more explicit than a badly documented convention though.
I think one of the things we’re coming up against here is again the distinction of third-party use and within-sklearn-use, both of the estimator checks and the utils. We really need to think more about this.
It’s noteworthy that right now,
check_estimator
does not usecheck_is_fitted
anywhere, socheck_is_fitted
should not be considered party of the API specification. So it does not enforce anything on third-party classes. I think your point is that it does enforce some restrictions on third-party estimators that inherit from existing estimators.Ot might have been better to make the version that has no parameters private from the start as it was geared towards internal use. If we want a consistent way to check whether something was fitted for anything that’s scikit-learn compatible, indeed
check_is_fitted
is not the right choice and an attribute or method would be better.I’m happy with the short-term solution. The question for the medium-term solution is what the purpose of
check_is_fitted
is. There is two implicit purposes: the original purpose was to make it easy to raise the right exception if an estimator callspredict
without being fitted. I abused / extended that for calling it outside the estimator itself, to check the state of an estimator (which was not possible before).Both of these are useful, but distinct functions, for third party libraries. If we want the ability to check the state from the outside, a method is probably a good way to go. If we then also want to make it easy to raise the right exception, we either need to provide a
_check_is_fitted
private method, or need to bake it into_validate_data
(https://github.com/scikit-learn/enhancement_proposals/pull/22/files), both of which are more “frameworky” than what we usually like.@alegonz Thanks for the thoughtful reply.
Regarding the short-term solution, I think that it makes sense. Then for the long-term solution, up-to-now I think that scikit-learn avoided using the
abstract
mechanisms on theBaseEstimator
. I would think that it was done on purpose, probably @amueller , @jnothman, @GaelVaroquaux, @ogrisel could give some thoughts. Instead, thecheck_estimator
can be used to ensure that an estimator raises theNotFittedError
when it is supposed to do so.But in general I think your suggestions are good.
We tend to proscribe setting anything in init aside from parameter copies because we don’t want estimators to validate or set derived attributes of these parameters. Doing so can break set_params.
Just to be clear… is check_is_fitted used within the estimator or by meta-estimators? it was clearer in the 0.21 interface that it was for use within the estimator, not by externals like meta-estimators. If that remains the case (and should be more clearly documented), then the fact that some estimators must call
check_is_fitted
withattributes
passed is no problem.If
check_is_fitted
is intended to be used by meta-estimators, then I agree, it’s not working well enough. In the future, checking the presence ofn_features_in_
should be sufficient.How about we make it continue to be automagic, except where
attributes
is specified. I.e. we stop deprecatingattributes
I prefer to revert before we come up with a better solution.
Again, we’ve always done deprecation cycles in utils. cf the dozen of PRs related to https://github.com/scikit-learn/scikit-learn/issues/6616#issuecomment-516479413