scikit-learn: Incremental PCA - ValueError: array must not contain infs or NaNs

I’m trying to use IncrementalPCA from sklearn.decomposition. My code couldn’t really be simpler:

from sklearn.decomposition import IncrementalPCA
import pandas as pd

with open('C:/My/File/Path/file.csv', 'r') as fp:
    data = pd.read_csv(fp)

ipca = IncrementalPCA(n_components=4)
ipca.fit(data)

but this is how it finishes when launched:

C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py:293: RuntimeWarning: overflow encountered in long_scalars
  np.sqrt((self.n_samples_seen_ * n_samples) /
C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py:293: RuntimeWarning: invalid value encountered in sqrt
  np.sqrt((self.n_samples_seen_ * n_samples) /
Traceback (most recent call last):
File "C:/Users/myuser/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_9.py", line 6, in <module>
  ipca.fit(data)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py", line 215, in fit
  self.partial_fit(X_batch, check_input=False)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py", line 298, in partial_fit
  U, S, V = linalg.svd(X, full_matrices=False)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\scipy\linalg\decomp_svd.py", line 106, in svd
  a1 = _asarray_validated(a, check_finite=check_finite)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\scipy\_lib\_util.py", line 263, in _asarray_validated
  a = toarray(a)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\numpy\lib\function_base.py", line 498, in asarray_chkfinite
  raise ValueError(
ValueError: array must not contain infs or NaNs

Process finished with exit code 1

I already checked:

There is no NaN, infinite or negative anywhere in my data
I had scikit-learn v0.22.2.post1, I updated to 0.23.1, no difference
If I use PCA instead of IncrementalPCA leaving everything else the same, everything works fine, no warnings, no errors, all good
Tried using both data = pd.read_csv(fp, dtype = 'Int64') and data = pd.read_csv(fp, dtype = np.float64) with no difference in results.
There were similar issues in previous versions, but they refer to versions around 0.16/0.17, most were with more complex code and afaik all were fixed around those versions

My data, exactly as I feed them to the above code. This are really just 243 columns x 2.000.000 rows of 0s and 1s. https://drive.google.com/file/d/1JBIliADt9TViTk8qjnmIS3RFEO934dY6/view?usp=sharing

Update Seems like the issue is related with the dataset size. If I try fitting to a smaller portion everything works fine. This is until I get around 1800000 rows. That’s where the error starts showing.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (7 by maintainers)

Most upvoted comments

Glad the change worked. I’m still working on writing some tests before submitting another push.

allanbutler on Jun 29, 2020

Now that I have uninstalled and reinstalled everything including the change of the push request. The error is really not present any more. Thanks for the feedback

johnny-mueller on Jun 29, 2020