scipy: BUG: scipy.linalg.svd produces spurious results for dtype float32 when using sklearn.decomposition.PCA
Describe your issue.
I have identified a bug when using sklearn.decomposition.PCA which is reported here: https://github.com/scikit-learn/scikit-learn/discussions/27946
I feel that it is appropriate to report the issue here also since the svd_solver='full'
argument calls scipy.linalg.svd to perform the SVD. The sklearn discussion has gone quiet and I feel that it is necessary to update the documentation to reflect this issue. Essentially, dtype float32 input arrays produce different (worse) results than the exact same array with dtype float64 i.e. the maximum discrepancy (np.max(np.abs(X_diff))
) changes from one dtype to the next. I have synthetic datasets that can be used to test this (X_t_test_noise_float32.npy
and X_t_test_noise_float64.npy
), but I don’t know how to supply them.
Reproducing Code Example
import numpy as np
from sklearn.decomposition import PCA
X = np.load('X_t_test_noise_float32.npy')
# X = np.load('X_t_test_noise_float64.npy')
# Percentage of variance to retain in PCA
variance = 0.999
# PCA model fitting and transformations
pca_model = PCA(n_components=variance, svd_solver='auto')
X_compressed = pca_model.fit_transform(X)
X_reconstructed = pca_model.inverse_transform(X_compressed)
print('Original shape:', X.shape)
print('Compressed shape:', X_compressed.shape)
print('Compression factor:', X.shape[1]//X_compressed.shape[1])
print('X_reconstructed mean:', np.mean(X_reconstructed))
print('X_reconstructed max:', np.max(X_reconstructed))
print('X_reconstructed min:', np.min(X_reconstructed))
# Calculate root-mean-square error with respect to the original data
X_diff = X - X_reconstructed
rmse = np.sqrt(np.mean(X_diff**2))
print('RMSE:', rmse)
print('Max discrepancy:', np.max(np.abs(X_diff)))
Error message
N/A
SciPy/NumPy/Python version and system information
1.7.3 1.21.6 sys.version_info(major=3, minor=7, micro=12, releaselevel='final', serial=0)
lapack_mkl_info:
libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack']
library_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/include']
lapack_opt_info:
libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack', 'blas', 'cblas', 'lapack']
library_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/include']
blas_mkl_info:
libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack']
library_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/include']
blas_opt_info:
libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack', 'blas', 'cblas', 'lapack']
library_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/data/users/jbowyer/conda/envs/machine_learning/include']
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 21 (13 by maintainers)
Ah, the test code you posted used
np.linalg.svd()
instead ofscipy.linalg.svd()
likePCA
uses. When I change it to usescipy.linalg.svd()
, I do see more of a discrepancy in thefloat32
branch, and that would explain the differences with PCA. It turns out thatnp.linalg.svd()
will use the double-precisionDGESDD
LAPACK driver regardless of whether the dtype of the input isfloat64
or not.scipy.linalg.svd()
will use the single-precisionFGESDD
driver when given afloat32
input.The
norm(X_diff)
that I get withscipy.linalg.svd()
is about 2500 times theS.max()*eps
; whether that is accounted for by the “slowly growing function ofm
andn
” multiplier, I’m not sure. Probably some of that is also from the extra accumulated numerical error from numerically reconstructing the array fromU @ S @ Vh
infloat32
arithmetic rather than the infinite-precision arithmetic implicitly assumed for that operation by that error analysis (which is not quite talking about your scenario where you numerically reconstruct the input with finite precision operations). Given the largen
, this amount of error certainly does not surprise me.In general, I’m not sure that there is anything that we could have said that would have saved you much time. The statements that we can make are fairly weak and somewhat distant from what you actually want. AFAIK, there actually is no rigorous proof constraining the numerical reconstruction error published for your exact scenario. As general background knowledge, you do need to know that
float32
will involve significantly more error thanfloat64
, but frequently, you will just have to try it and see to find out how much.Hi @matthew-brett, thank you for acknowledging my request to provide the data. I will have a go at your suggestion, it might take me a while though.