scikit-learn: Duplicate check_finite when calling scipy.linalg functions

Most functions in scipy.linalg functions (e.g. svd, qr, eig, eigh, pinv, pinv2 …) have a default kwarg check_finite=True that we typically leave to the default value in scikit-learn.

As we already validate the input data for most estimators in scikit-learn, this check is redundant and can cause significant overhead, especially at predict / transform time. We should probably always call those method with an explicit check_finite=False in scikit-learn.

This issue shall probably be addressed in many PRs, probably one per module that imports scipy.linalg.

We should still make sure that the estimators raise a ValueError with the expected error message when fed with numpy arrays with infinite some values (-np.inf, np.inf or np.nan). This can be done manually by calling sklearn.utils.estimator_checks.check_estimators_nan_inf on the estimator, which should be automatically be called by sklearn.tests.test_common but we need to check that it’s actually the case when reviewing such PRs.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 2
  • Comments: 34 (18 by maintainers)

Most upvoted comments

Please do not discuss who is working on what in the comments of this issue. Instead try to read the source code of the functions I mentioned in the description, consider first working on a module that has a few occurrences of the scipy.linalg function and open a first pull request that mentions this issue number in the description (#18837 in this case) to automatically link your PR to this issue and the module name in the title, e.g. “Use check_finite=False in sklearn.module_name”. Then automatically we will see who is working on what in this issue and can help individual contributors in the review of their own PR.

If this is your first time contribution, don’t forget to follow the guide (including the video tutorials):

https://scikit-learn.org/dev/developers/contributing.html#contributing-code

In general, I think we should only set check_finite=False only if we are completely sure that the input is checked before.

@sadakmed actually even for the case of np.dot I think it may be non finite. e.g. on my machine I have:

>>> import numpy as np
>>> X = np.array([1, 1e300])
>>> np.dot(X.T, X)
inf

(e.g. https://github.com/scikit-learn/scikit-learn/pull/18848/files#r531103450)

@manjitullal The check_finite=True are implicit (they are default arguments in scipy), they are not specified in sklearn code. So I think this issue should be open as of now.

Hi @LuiGiovanni 👋 If you are still willing to work on this, the gaussian_process module and kernel_approximation.py need inspection. You can open a dedicated PR for those modules as done previously for other modules (for instance #18845).

@ogrisel @thomasjpfan It would be great if we could have a maintainer confirmation (or infirmation?) that we need the default check_finite=True for cases like this: https://github.com/scikit-learn/scikit-learn/pull/18850/files#r526344155 https://github.com/scikit-learn/scikit-learn/pull/18848/files#r531103450 https://github.com/scikit-learn/scikit-learn/pull/18909/files#r531115673

I wrote duplicated comments telling that we need the default check_finite=True but I do not know if my reasons are valid.

i see that none of the " git grep linalg – sklearn | grep import | grep scipy | grep -v cython_blas " files have check_finite=True anymore, looks like its fixed. Let me know if this still requires work, else this issue could be closed.