scipy: Sparse COO.toarray fails if nnz > 2**31

While handling of sparse arrays with more that 2**31 non zero elements was enabled in #442, it looks like the conversion from CSR to a dense array still fails in this case with an error in coo_todense .

In [1]: import scipy.sparse
     .. import numpy as np
     .. X = scipy.sparse.csr_matrix(np.ones((2**16, 1)))
     .. X
Out[1]:
# <65536x1 sparse matrix of type '<class 'numpy.float64'>'
#        with 65536 stored elements in Compressed Sparse Row format>

In [2]: (X*X[:2**15-1].T).toarray()
Out[2]:
array([[ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       ...,
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.]])

In [3]: (X*X[:2**15].T).toarray()
ValueError                                Traceback (most recent call last)
<ipython-input-6-ee8192b524ea> in <module>()
----> 1 (X*X[:2**15].T).toarray()

/home/ec2-user/miniconda3/envs/test-env/lib/python3.6/site-packages/scipy/sparse/compressed.py in toarray(self, order, out)
    918     def toarray(self, order=None, out=None):
    919         """See the docstring for `spmatrix.toarray`."""
--> 920         return self.tocoo(copy=False).toarray(order=order, out=out)
    921
    922     ##############################################################

/home/ec2-user/miniconda3/envs/test-env/lib/python3.6/site-packages/scipy/sparse/coo.py in toarray(self, order, out)
    256         M,N = self.shape
    257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258                     B.ravel('A'), fortran)
    259         return B
    260

ValueError: could not convert integer scalar

Note: approximately 64 GB RAM is necessary to run this example.

Using numpy 1.12.1 and scipy 0.18.1 on Linux.

What would be the advised workaround in this case to convert a CSR array to a dense array? something along the lines of the function below?

def csr_to_array_fallback(X):
    if X.format != 'csr':
        raise ValueError
    row_indices = np.zeros(len(X.indices), dtype='int32')
    indptr = X.indptr 
    for idx in range(len(indptr) - 1):
        row_indices[indptr[idx]:indptr[idx+1]] = idx
    Y = np.zeros(X.shape, dtype=X.dtype)
    Y[row_indices, X.indices] = X.data 
    return Y 

I believe this is, in particular, causing the following issues in scikit-learn,

cc @eugene-yang

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 15 (14 by maintainers)

Most upvoted comments

Unfortunately that fix is still unverified (see the comments starting here), so we can’t merge it yet.

No, as only the nnz integer would be int64. . Basically, the task for someone interested would be to replace all integer scalar I by npy_int64 in the sparsetools/*.h, scrap the now-unnecessary integer scalar type detection in sparsetools.cxx, and fix up the integer scalar type generation in generate_sparsetools.py. . This would have some impact on 32-bit platform performance, but I think those are not relevant any more.