scikit-learn: Duplicate check_finite when calling scipy.linalg functions
Most functions in scipy.linalg
functions (e.g. svd
, qr
, eig
, eigh
, pinv
, pinv2
…) have a default kwarg check_finite=True
that we typically leave to the default value in scikit-learn.
As we already validate the input data for most estimators in scikit-learn, this check is redundant and can cause significant overhead, especially at predict / transform time. We should probably always call those method with an explicit check_finite=False
in scikit-learn.
This issue shall probably be addressed in many PRs, probably one per module that imports scipy.linalg
.
We should still make sure that the estimators raise a ValueError
with the expected error message when fed with numpy arrays with infinite some values (-np.inf
, np.inf
or np.nan
). This can be done manually by calling sklearn.utils.estimator_checks.check_estimators_nan_inf
on the estimator, which should be automatically be called by sklearn.tests.test_common
but we need to check that it’s actually the case when reviewing such PRs.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 2
- Comments: 34 (18 by maintainers)
Please do not discuss who is working on what in the comments of this issue. Instead try to read the source code of the functions I mentioned in the description, consider first working on a module that has a few occurrences of the
scipy.linalg
function and open a first pull request that mentions this issue number in the description (#18837 in this case) to automatically link your PR to this issue and the module name in the title, e.g. “Use check_finite=False in sklearn.module_name”. Then automatically we will see who is working on what in this issue and can help individual contributors in the review of their own PR.If this is your first time contribution, don’t forget to follow the guide (including the video tutorials):
https://scikit-learn.org/dev/developers/contributing.html#contributing-code
In general, I think we should only set
check_finite=False
only if we are completely sure that the input is checked before.@sadakmed actually even for the case of
np.dot
I think it may be non finite. e.g. on my machine I have:(e.g. https://github.com/scikit-learn/scikit-learn/pull/18848/files#r531103450)
@manjitullal The check_finite=True are implicit (they are default arguments in scipy), they are not specified in sklearn code. So I think this issue should be open as of now.
Hi @LuiGiovanni 👋 If you are still willing to work on this, the
gaussian_process
module andkernel_approximation.py
need inspection. You can open a dedicated PR for those modules as done previously for other modules (for instance #18845).@ogrisel @thomasjpfan It would be great if we could have a maintainer confirmation (or infirmation?) that we need the default check_finite=True for cases like this: https://github.com/scikit-learn/scikit-learn/pull/18850/files#r526344155 https://github.com/scikit-learn/scikit-learn/pull/18848/files#r531103450 https://github.com/scikit-learn/scikit-learn/pull/18909/files#r531115673
I wrote duplicated comments telling that we need the default check_finite=True but I do not know if my reasons are valid.
i see that none of the " git grep linalg – sklearn | grep import | grep scipy | grep -v cython_blas " files have check_finite=True anymore, looks like its fixed. Let me know if this still requires work, else this issue could be closed.