cupy: cub.device_csrmv is corrupting sparse arrays

Here’s a reproducer of the behavior we are seeing on cuML. What’s really strange is that this doesn’t seem to be happening consistently across Linux versions & GPUs.

>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> b = a.tocsr()
>>> cupy.testing.assert_array_equal(a.col, b.indices)
>>> cupy.testing.assert_array_equal(a.data, b.data)
>>> a.sum(axis=0)
array([[0.00000000e+000, 0.00000000e+000, 4.24399158e-314, ...,
        1.10383165e-307, 1.10512862e-307, 1.10636362e-307]])
>>> b.sum(axis=0)
array([[3056.30047519, 3058.09608272, 3009.4495842 , ..., 1505.96878088,
        1460.47145535, 1505.26624145]])

We know the CSR (b) is correct because the COO that’s yielding incorrect results in one of our primitives.

Original cuml issue: https://github.com/rapidsai/cuml/issues/2724

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 1
  • Comments: 50 (49 by maintainers)

Most upvoted comments

@leofang, the interesting thing is that the only code path that seems to use cub’s csrmv is when the dot product is taken between a coo matrix and a dense array. A different code path is used when a dot product is taken between a CSR and a dense vector. Is it possible that exact code path is not tested? We wouldn’t have caught this in our pytests on cuml had cupy.sparse.random not returned a coo by default (and the temporary fix for us was to just have cupy.sparse.random output a CSR).

I’m on my phone right now but I can look through the cupy pytests when I get in front of a computer next.

@leofang, removing just the cub.cpp files was a HUGE savings. Thank you for mentioning that trick!

It looks like reverting the commit from my previous comment fixes the issue!

>>> import cupy
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> a.T.dot(cupy.ones(a.shape[0]))
array([3067.87828171, 3042.30450824, 3076.94547971, ..., 1484.1127642 ,
       1494.23297932, 1455.11828512])

Of course, I don’t know why the change was made and I don’t know if reverting it breaks anything else in cub or cupy, but at least we have a handle on the problem now.

I am able to reproduce the bug on the master branch of cub. I guess I could start a git bisect to try to isolate the exact commit

>>> import cupy
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> a.T.dot(cupy.ones(a.shape[0]))
array([0.00000000e+000, 1.03977794e-312, 7.12141787e-311, ...,
       5.57236095e-311, 1.97748788e-310, 6.18986172e-311])

Cupy 7.8.0 CUDA11 works w/ CUB 1.8.0!

Type "help", "copyright", "credits" or "license" for more information.
>>> import cupy
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
^[[Aa.T.dot(cupy.ones(a.shape[0]))
array([3069.12814446, 3016.85650629, 3040.1645186 , ..., 1482.05807122,
       1470.98337495, 1467.97333949])
>>> cupy.cuda.cub_enabled
True

That supports your theory that it’s a bug in CUDA11’s CUB.

@leofang, ah ha! I was able to reproduce the bug w/ v8.0.0 using CUPY_ACCELERATORS=cub:

(cuml_015_cuda11_081920) cjnolet@deeplearn ~ $ CUPY_ACCELERATORS=cub python
Python 3.8.5 | packaged by conda-forge | (default, Aug 19 2020, 12:22:27) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cupy
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> a.T.dot(cupy.ones(a.shape[0]))
array([1.0949384e-317, 3.4308892e-317, 3.4229080e-317, ...,
       3.5623151e-317, 2.3161165e-317, 3.3354051e-317])

@kmaehashi @emcastillo So I looked into this today and found it’s accidentally fixed upstream (NVIDIA/cub#249), see https://github.com/NVIDIA/cub/pull/196#issuecomment-782293149. But, the fix won’t land in CUDA 11.3, meaning all CUDA 11.x users are impacted.

Given that Thrust and CUB are coupled tighter since CUDA 11.0, I propose two solutions for CUDA 11.x:

  1. Use CuPy’s older CUB bundle
  2. Apply upstream patch to CuPy’s CUB bundle at build time

Both solutions require passing the flag -DTHRUST_IGNORE_CUB_VERSION_CHECK=1 to the compiler to ignore incompatible Thrust version, otherwise we’ll also have to bundle Thrust.

@leofang, I just found out the different code path I was seeing is from the a.T, which converts the csr into a csc.

>>> a = cupy.sparse.random(10000, 1500, format='csr', density=0.5)
>>> a.T.dot(cupy.ones(a.shape[0]))
array([3123.54679989, 3073.97259718, 3052.88266851, ..., 1461.38287974,
       1504.42127986, 1475.40433775])
>>> a.dot(cupy.ones(a.shape[0]))
array([1.50661701e-312, 1.66152270e-311, 1.43022516e-311, ...,
       6.02646805e-312, 5.07156994e-312, 7.04502603e-312])
>>> a.T
<cupyx.scipy.sparse.csc.csc_matrix object at 0x7f896be2f730>

I admit I hadn’t verified whether CSR & COO were still taking different code paths in 8.0.0. It’s good to know those have been consolidated!

Okay, it does appear that in CuPy 8.0.0, both CSR and COO are using the same csrmv code path. This was not the case for Cupy 7.8.0. Here’s my results for cupy 8.0.0 with CUB with a smaller matrix:


>>> import cupy
>>> a = cupy.sparse.random(100, 100, format='coo', density=0.9)
^[[Aa.dot(cupy.ones(a.shape[1]))
INSIDE CUB CSRMV
array([-1.09843973e-080,  4.91008549e+046, -1.17988717e-104,
        6.99932043e+295,  2.42403271e-127, -3.26581677e-277,
        3.47191605e-129,  6.62818767e-021,  3.06974713e-149,
       -2.11214480e-057, -7.77939700e+159, -5.40782771e-103,
        2.03859938e-282, -9.32777368e+027,  1.88102132e-108,
        1.29794968e+120, -2.09877893e-199,  3.44640527e-070,
       -1.38748257e-052,  3.84909889e+261, -9.18282810e+236,
        2.47000872e-091,  1.58638379e+195, -5.36167171e+115,
       -3.35945186e-227, -5.62999738e-141,  9.77542923e+127,
        1.76174398e+118, -5.44874182e-268,  1.93601629e+098,
       -9.29514754e+191,  1.31356814e+046, -1.27918464e+204,
       -2.49376511e+241,  6.96890771e+300,  3.11723076e-242,
        4.33077442e-169, -5.67430417e-182,  1.90971840e+236,
        2.71666082e-305,  2.19547250e+173, -1.36475442e+249,
       -9.35892812e-181,  2.12159250e-075, -3.19393319e-029,
       -8.18874645e+218,  1.12550972e-182, -1.60964135e-184,
        4.06294253e-085, -3.47320529e+078,  2.34100711e+255,
       -1.61038977e-228,  1.37554378e-151, -4.32621245e-122,
       -1.10940516e-132, -9.64650402e+259, -5.76779187e+143,
        5.65291584e-218, -2.00431360e-230,  1.88941856e+297,
       -1.51501755e+044,  6.00606503e+091,  6.00132871e-100,
       -1.87984317e-083,  1.76310444e+176, -1.05254743e+162,
        1.82569015e-289,  2.65043541e+306,  7.62356251e+287,
        7.71462371e-204,  4.60326697e+125, -6.40480387e-128,
        1.22730282e-117,  2.65811104e-036, -1.91813167e+191,
        4.77836538e+303,  1.31087216e+029,  2.50214181e+225,
       -1.97948461e-025,  2.06768007e+260, -3.51315475e-064,
       -2.51079189e-017,  2.35410016e+226, -5.89173520e+268,
       -9.09507957e-037, -1.44544941e+119, -1.78004770e+083,
       -1.07267670e-294, -8.34627301e+091, -2.20502303e-203,
       -1.42694819e+252,  2.43210978e-291,  7.45285350e-034,
       -3.98635272e-180,  2.55662269e-075, -7.78645011e+193,
        7.49659229e+195,  2.15807841e-100,  1.00274159e+192,
        7.46077815e+145])
>>> a = cupy.sparse.random(100, 100, format='csr', density=0.9)
>>> a.dot(cupy.ones(a.shape[1]))
INSIDE CUB CSRMV
array([-1.93835463e-004,  2.16301178e-087, -1.27529377e-248,
       -5.91318847e+256,  6.12721527e+048,  3.54993243e+221,
       -2.45782167e-217, -1.13194865e-270,  1.19685764e-307,
       -1.70723501e+048,  3.30074267e+158, -1.06484668e+303,
        1.06033238e-222,  2.44691967e-045,  3.35712358e+018,
        3.68890212e-005, -1.68353056e-237, -5.48242687e-199,
       -3.55293075e+170,  3.06103573e-063,  7.71081785e+115,
        1.34187650e-131,  4.13998181e+201,  2.51162148e-300,
        1.08663275e-083, -5.83086283e-240,  1.71759863e+217,
       -3.27619099e+124,  4.46342889e+232, -2.21296752e-198,
       -1.30422434e+130, -5.10990069e-101, -1.91302098e-222,
       -3.98240656e+124,  1.44387526e+276, -2.88128888e-081,
       -7.41149511e+175, -2.23917292e-093,  3.17862269e-082,
       -1.64747643e+094,  3.17246431e+139,  5.97285105e+051,
        2.86046073e+126, -1.87682192e+213,  8.18379821e+018,
        6.11198604e-268, -2.92639926e-106,  3.91598542e+245,
        4.43163723e+138, -1.95233717e+301,  3.38444604e+072,
       -2.25374508e+227, -2.90772594e+083, -2.50408034e+298,
       -1.21599827e+185, -1.06936930e+029, -1.58090080e+141,
       -1.29898423e+299, -1.63729333e-082, -1.73716602e+068,
       -2.70437597e+066, -1.45541519e-093, -8.97278764e+113,
        1.10328388e-134, -9.52743196e-118,  1.26758494e-247,
       -7.54001140e-084,  4.70757436e+048,  2.44793185e-023,
       -4.20990335e-166,  6.28462675e+245,  2.47417325e-052,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000])
>>> import cupy as cp
>>> a = cp.sparse.random(10000, 1500, format='coo', density=0.5)
>>> x = cp.random.random(1500)
>>> a * x
CUB is used
array([1.3996321e-317, 5.4763550e-317, 4.1680465e-318, ...,
       1.7190224e-318, 5.3377049e-317, 4.6850723e-317])
>>> a = cp.sparse.random(10000, 1500, format='csr', density=0.5)
>>> a * x
CUB is used
array([6.5346501e-317, 7.1236238e-317, 2.3220685e-317, ...,
       5.0230666e-317, 5.1362121e-317, 6.7755397e-317])

@leofang, there does seem to be a threshold at play here, but it seems super small. This might also explain why CI isn’t catching it in the pytests.

>>> a = cupy.sparse.csr_matrix(cupy.ones((20, 20)))
>>> a.dot(cupy.ones(a.shape[1]))
Debug: INSIDE CUB CSRMV
array([7.74860419e-304, 7.74860419e-304, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.00000000e+000])
>>> a = cupy.sparse.csr_matrix(cupy.ones((10, 10)))
>>> a.dot(cupy.ones(a.shape[1]))
Debug: INSIDE CUB CSRMV
array([10., 10., 10., 10., 10., 10., 10., 10., 10., 10.])

Here’s with CUPY_ACCELERATORS=:

>>> import cupy
>>> a = cupy.sparse.csr_matrix(cupy.ones((20, 20)))
>>> a.dot(cupy.ones(a.shape[1]))
array([20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20.,
       20., 20., 20., 20., 20., 20., 20.])

I can’t think of any recent changes to CUB that might have caused this, but if you can put together a pure small Thrust/CUB-only reproducer I’ll take a look.

Tagging @allisonvacanti for awareness.

Maybe we could set CUB_PATH to somewhere that doesn’t have CUB?

@leofang, we deployed our own build to the RAPIDS conda so that we could get the CUDA 11 support out before v8.0.0 is released. It sounds like maybe we need another one built without CUB support.

This line is causing the failure:

>>> a.T.dot(cupy.ones(a.shape[0])).reshape(1, a.shape[1])
array([[0.00000000e+000, 0.00000000e+000, 4.24399158e-314, ...,
        1.10383165e-307, 1.10512862e-307, 1.10636362e-307]])
>>> b.T.dot(cupy.ones(b.shape[0])).reshape(1, b.shape[1])
array([[3056.30047519, 3058.09608272, 3009.4495842 , ..., 1505.96878088,
        1460.47145535, 1505.26624145]])