scipy: Sparse COO.toarray fails if nnz > 2**31
While handling of sparse arrays with more that 2**31 non zero elements was enabled in #442, it looks like the conversion from CSR to a dense array still fails in this case with an error in coo_todense
.
In [1]: import scipy.sparse
.. import numpy as np
.. X = scipy.sparse.csr_matrix(np.ones((2**16, 1)))
.. X
Out[1]:
# <65536x1 sparse matrix of type '<class 'numpy.float64'>'
# with 65536 stored elements in Compressed Sparse Row format>
In [2]: (X*X[:2**15-1].T).toarray()
Out[2]:
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
In [3]: (X*X[:2**15].T).toarray()
ValueError Traceback (most recent call last)
<ipython-input-6-ee8192b524ea> in <module>()
----> 1 (X*X[:2**15].T).toarray()
/home/ec2-user/miniconda3/envs/test-env/lib/python3.6/site-packages/scipy/sparse/compressed.py in toarray(self, order, out)
918 def toarray(self, order=None, out=None):
919 """See the docstring for `spmatrix.toarray`."""
--> 920 return self.tocoo(copy=False).toarray(order=order, out=out)
921
922 ##############################################################
/home/ec2-user/miniconda3/envs/test-env/lib/python3.6/site-packages/scipy/sparse/coo.py in toarray(self, order, out)
256 M,N = self.shape
257 coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258 B.ravel('A'), fortran)
259 return B
260
ValueError: could not convert integer scalar
Note: approximately 64 GB RAM is necessary to run this example.
Using numpy 1.12.1 and scipy 0.18.1 on Linux.
What would be the advised workaround in this case to convert a CSR array to a dense array? something along the lines of the function below?
def csr_to_array_fallback(X):
if X.format != 'csr':
raise ValueError
row_indices = np.zeros(len(X.indices), dtype='int32')
indptr = X.indptr
for idx in range(len(indptr) - 1):
row_indices[indptr[idx]:indptr[idx+1]] = idx
Y = np.zeros(X.shape, dtype=X.dtype)
Y[row_indices, X.indices] = X.data
return Y
I believe this is, in particular, causing the following issues in scikit-learn,
- http://stackoverflow.com/questions/42543125/could-not-convert-integer-scalar-error-when-using-dbscan
- https://github.com/FreeDiscovery/FreeDiscovery/issues/126
cc @eugene-yang
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 15 (14 by maintainers)
Unfortunately that fix is still unverified (see the comments starting here), so we can’t merge it yet.
No, as only the nnz integer would be int64. . Basically, the task for someone interested would be to replace all integer scalar
I
bynpy_int64
in the sparsetools/*.h, scrap the now-unnecessary integer scalar type detection in sparsetools.cxx, and fix up the integer scalar type generation ingenerate_sparsetools.py
. . This would have some impact on 32-bit platform performance, but I think those are not relevant any more.