scipy: Sparse COO.toarray fails if nnz > 2**31

While handling of sparse arrays with more that 2**31 non zero elements was enabled in #442, it looks like the conversion from CSR to a dense array still fails in this case with an error in coo_todense .

In [1]: import scipy.sparse
     .. import numpy as np
     .. X = scipy.sparse.csr_matrix(np.ones((2**16, 1)))
     .. X
Out[1]:
# <65536x1 sparse matrix of type '<class 'numpy.float64'>'
#        with 65536 stored elements in Compressed Sparse Row format>

In [2]: (X*X[:2**15-1].T).toarray()
Out[2]:
array([[ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       ...,
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.]])

In [3]: (X*X[:2**15].T).toarray()
ValueError                                Traceback (most recent call last)
<ipython-input-6-ee8192b524ea> in <module>()
----> 1 (X*X[:2**15].T).toarray()

/home/ec2-user/miniconda3/envs/test-env/lib/python3.6/site-packages/scipy/sparse/compressed.py in toarray(self, order, out)
    918     def toarray(self, order=None, out=None):
    919         """See the docstring for `spmatrix.toarray`."""
--> 920         return self.tocoo(copy=False).toarray(order=order, out=out)
    921
    922     ##############################################################

/home/ec2-user/miniconda3/envs/test-env/lib/python3.6/site-packages/scipy/sparse/coo.py in toarray(self, order, out)
    256         M,N = self.shape
    257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258                     B.ravel('A'), fortran)
    259         return B
    260

ValueError: could not convert integer scalar

Note: approximately 64 GB RAM is necessary to run this example.

Using numpy 1.12.1 and scipy 0.18.1 on Linux.

What would be the advised workaround in this case to convert a CSR array to a dense array? something along the lines of the function below?

def csr_to_array_fallback(X):
    if X.format != 'csr':
        raise ValueError
    row_indices = np.zeros(len(X.indices), dtype='int32')
    indptr = X.indptr 
    for idx in range(len(indptr) - 1):
        row_indices[indptr[idx]:indptr[idx+1]] = idx
    Y = np.zeros(X.shape, dtype=X.dtype)
    Y[row_indices, X.indices] = X.data 
    return Y

I believe this is, in particular, causing the following issues in scikit-learn,

cc @eugene-yang

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 15 (14 by maintainers)

Most upvoted comments

Unfortunately that fix is still unverified (see the comments starting here), so we can’t merge it yet.

perimosocordiae on Aug 2, 2017

No, as only the nnz integer would be int64. . Basically, the task for someone interested would be to replace all integer scalar I by npy_int64 in the sparsetools/*.h, scrap the now-unnecessary integer scalar type detection in sparsetools.cxx, and fix up the integer scalar type generation in generate_sparsetools.py. . This would have some impact on 32-bit platform performance, but I think those are not relevant any more.

pv on Mar 28, 2017