scipy: BUG: scipy.linalg.svd produces spurious results for dtype float32 when using sklearn.decomposition.PCA

Describe your issue.

I have identified a bug when using sklearn.decomposition.PCA which is reported here: https://github.com/scikit-learn/scikit-learn/discussions/27946

I feel that it is appropriate to report the issue here also since the svd_solver='full' argument calls scipy.linalg.svd to perform the SVD. The sklearn discussion has gone quiet and I feel that it is necessary to update the documentation to reflect this issue. Essentially, dtype float32 input arrays produce different (worse) results than the exact same array with dtype float64 i.e. the maximum discrepancy (np.max(np.abs(X_diff))) changes from one dtype to the next. I have synthetic datasets that can be used to test this (X_t_test_noise_float32.npy and X_t_test_noise_float64.npy), but I don’t know how to supply them.

Reproducing Code Example

import numpy as np
from sklearn.decomposition import PCA

X = np.load('X_t_test_noise_float32.npy')
# X = np.load('X_t_test_noise_float64.npy')

# Percentage of variance to retain in PCA
variance = 0.999

# PCA model fitting and transformations
pca_model = PCA(n_components=variance, svd_solver='auto')
X_compressed = pca_model.fit_transform(X)
X_reconstructed = pca_model.inverse_transform(X_compressed)
print('Original shape:', X.shape)
print('Compressed shape:', X_compressed.shape)
print('Compression factor:', X.shape[1]//X_compressed.shape[1])

print('X_reconstructed mean:', np.mean(X_reconstructed))
print('X_reconstructed max:', np.max(X_reconstructed))
print('X_reconstructed min:', np.min(X_reconstructed))

# Calculate root-mean-square error with respect to the original data
X_diff = X - X_reconstructed
rmse = np.sqrt(np.mean(X_diff**2))
print('RMSE:', rmse)
print('Max discrepancy:', np.max(np.abs(X_diff)))

Error message

N/A

SciPy/NumPy/Python version and system information

1.7.3 1.21.6 sys.version_info(major=3, minor=7, micro=12, releaselevel='final', serial=0)
lapack_mkl_info:
    libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack']
    library_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/include']
lapack_opt_info:
    libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack', 'blas', 'cblas', 'lapack']
    library_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/include']
blas_mkl_info:
    libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack']
    library_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/include']
blas_opt_info:
    libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack', 'blas', 'cblas', 'lapack']
    library_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/include']

About this issue

Original URL
State: closed
Created 5 months ago
Comments: 21 (13 by maintainers)

Commits related to this issue

Update _pca.py Updated docstrings to reflect spurious results encountered when applying the PCA model to large datasets of dtype 'float32'. See the following threads: https://github.com/scikit-learn... — committed to jackbowyer/scikit-learn by jackbowyer 4 months ago

Most upvoted comments

Ah, the test code you posted used np.linalg.svd() instead of scipy.linalg.svd() like PCA uses. When I change it to use scipy.linalg.svd(), I do see more of a discrepancy in the float32 branch, and that would explain the differences with PCA. It turns out that np.linalg.svd() will use the double-precision DGESDD LAPACK driver regardless of whether the dtype of the input is float64 or not. scipy.linalg.svd() will use the single-precision FGESDD driver when given a float32 input.

The norm(X_diff) that I get with scipy.linalg.svd() is about 2500 times the S.max()*eps; whether that is accounted for by the “slowly growing function of m and n” multiplier, I’m not sure. Probably some of that is also from the extra accumulated numerical error from numerically reconstructing the array from U @ S @ Vh in float32 arithmetic rather than the infinite-precision arithmetic implicitly assumed for that operation by that error analysis (which is not quite talking about your scenario where you numerically reconstruct the input with finite precision operations). Given the large n, this amount of error certainly does not surprise me.

In general, I’m not sure that there is anything that we could have said that would have saved you much time. The statements that we can make are fairly weak and somewhat distant from what you actually want. AFAIK, there actually is no rigorous proof constraining the numerical reconstruction error published for your exact scenario. As general background knowledge, you do need to know that float32 will involve significantly more error than float64, but frequently, you will just have to try it and see to find out how much.

rkern on Feb 15, 2024

Hi @matthew-brett, thank you for acknowledging my request to provide the data. I will have a go at your suggestion, it might take me a while though.

jackbowyer on Feb 14, 2024