cupy: cub.device_csrmv is corrupting sparse arrays

Here’s a reproducer of the behavior we are seeing on cuML. What’s really strange is that this doesn’t seem to be happening consistently across Linux versions & GPUs.

>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> b = a.tocsr()
>>> cupy.testing.assert_array_equal(a.col, b.indices)
>>> cupy.testing.assert_array_equal(a.data, b.data)
>>> a.sum(axis=0)
array([[0.00000000e+000, 0.00000000e+000, 4.24399158e-314, ...,
        1.10383165e-307, 1.10512862e-307, 1.10636362e-307]])
>>> b.sum(axis=0)
array([[3056.30047519, 3058.09608272, 3009.4495842 , ..., 1505.96878088,
        1460.47145535, 1505.26624145]])

We know the CSR (b) is correct because the COO that’s yielding incorrect results in one of our primitives.

Original cuml issue: https://github.com/rapidsai/cuml/issues/2724

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 1
Comments: 50 (49 by maintainers)

Most upvoted comments

This seems to be the commit where it broke: https://github.com/NVlabs/cub/commit/b2e64cf0fb4ea7ace6c86ca6765ca7c1087ef82e

cjnolet on Aug 21, 2020

@leofang, the interesting thing is that the only code path that seems to use cub’s csrmv is when the dot product is taken between a coo matrix and a dense array. A different code path is used when a dot product is taken between a CSR and a dense vector. Is it possible that exact code path is not tested? We wouldn’t have caught this in our pytests on cuml had cupy.sparse.random not returned a coo by default (and the temporary fix for us was to just have cupy.sparse.random output a CSR).

I’m on my phone right now but I can look through the cupy pytests when I get in front of a computer next.

cjnolet on Aug 25, 2020

@leofang, removing just the cub.cpp files was a HUGE savings. Thank you for mentioning that trick!

It looks like reverting the commit from my previous comment fixes the issue!

>>> import cupy
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> a.T.dot(cupy.ones(a.shape[0]))
array([3067.87828171, 3042.30450824, 3076.94547971, ..., 1484.1127642 ,
       1494.23297932, 1455.11828512])

Of course, I don’t know why the change was made and I don’t know if reverting it breaks anything else in cub or cupy, but at least we have a handle on the problem now.

cjnolet on Aug 21, 2020

I am able to reproduce the bug on the master branch of cub. I guess I could start a git bisect to try to isolate the exact commit

>>> import cupy
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> a.T.dot(cupy.ones(a.shape[0]))
array([0.00000000e+000, 1.03977794e-312, 7.12141787e-311, ...,
       5.57236095e-311, 1.97748788e-310, 6.18986172e-311])

cjnolet on Aug 21, 2020

Cupy 7.8.0 CUDA11 works w/ CUB 1.8.0!

Type "help", "copyright", "credits" or "license" for more information.
>>> import cupy
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
^[[Aa.T.dot(cupy.ones(a.shape[0]))
array([3069.12814446, 3016.85650629, 3040.1645186 , ..., 1482.05807122,
       1470.98337495, 1467.97333949])
>>> cupy.cuda.cub_enabled
True

That supports your theory that it’s a bug in CUDA11’s CUB.

cjnolet on Aug 21, 2020

@leofang, ah ha! I was able to reproduce the bug w/ v8.0.0 using CUPY_ACCELERATORS=cub:

(cuml_015_cuda11_081920) cjnolet@deeplearn ~ $ CUPY_ACCELERATORS=cub python
Python 3.8.5 | packaged by conda-forge | (default, Aug 19 2020, 12:22:27) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cupy
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> a.T.dot(cupy.ones(a.shape[0]))
array([1.0949384e-317, 3.4308892e-317, 3.4229080e-317, ...,
       3.5623151e-317, 2.3161165e-317, 3.3354051e-317])

cjnolet on Aug 20, 2020

@kmaehashi @emcastillo So I looked into this today and found it’s accidentally fixed upstream (NVIDIA/cub#249), see https://github.com/NVIDIA/cub/pull/196#issuecomment-782293149. But, the fix won’t land in CUDA 11.3, meaning all CUDA 11.x users are impacted.

Given that Thrust and CUB are coupled tighter since CUDA 11.0, I propose two solutions for CUDA 11.x:

Use CuPy’s older CUB bundle
Apply upstream patch to CuPy’s CUB bundle at build time

Both solutions require passing the flag -DTHRUST_IGNORE_CUB_VERSION_CHECK=1 to the compiler to ignore incompatible Thrust version, otherwise we’ll also have to bundle Thrust.

leofang on Feb 20, 2021

@leofang, I just found out the different code path I was seeing is from the a.T, which converts the csr into a csc.

>>> a = cupy.sparse.random(10000, 1500, format='csr', density=0.5)
>>> a.T.dot(cupy.ones(a.shape[0]))
array([3123.54679989, 3073.97259718, 3052.88266851, ..., 1461.38287974,
       1504.42127986, 1475.40433775])
>>> a.dot(cupy.ones(a.shape[0]))
array([1.50661701e-312, 1.66152270e-311, 1.43022516e-311, ...,
       6.02646805e-312, 5.07156994e-312, 7.04502603e-312])
>>> a.T
<cupyx.scipy.sparse.csc.csc_matrix object at 0x7f896be2f730>

cjnolet on Aug 25, 2020

I admit I hadn’t verified whether CSR & COO were still taking different code paths in 8.0.0. It’s good to know those have been consolidated!

cjnolet on Aug 25, 2020

Okay, it does appear that in CuPy 8.0.0, both CSR and COO are using the same csrmv code path. This was not the case for Cupy 7.8.0. Here’s my results for cupy 8.0.0 with CUB with a smaller matrix:


>>> import cupy
>>> a = cupy.sparse.random(100, 100, format='coo', density=0.9)
^[[Aa.dot(cupy.ones(a.shape[1]))
INSIDE CUB CSRMV
array([-1.09843973e-080,  4.91008549e+046, -1.17988717e-104,
        6.99932043e+295,  2.42403271e-127, -3.26581677e-277,
        3.47191605e-129,  6.62818767e-021,  3.06974713e-149,
       -2.11214480e-057, -7.77939700e+159, -5.40782771e-103,
        2.03859938e-282, -9.32777368e+027,  1.88102132e-108,
        1.29794968e+120, -2.09877893e-199,  3.44640527e-070,
       -1.38748257e-052,  3.84909889e+261, -9.18282810e+236,
        2.47000872e-091,  1.58638379e+195, -5.36167171e+115,
       -3.35945186e-227, -5.62999738e-141,  9.77542923e+127,
        1.76174398e+118, -5.44874182e-268,  1.93601629e+098,
       -9.29514754e+191,  1.31356814e+046, -1.27918464e+204,
       -2.49376511e+241,  6.96890771e+300,  3.11723076e-242,
        4.33077442e-169, -5.67430417e-182,  1.90971840e+236,
        2.71666082e-305,  2.19547250e+173, -1.36475442e+249,
       -9.35892812e-181,  2.12159250e-075, -3.19393319e-029,
       -8.18874645e+218,  1.12550972e-182, -1.60964135e-184,
        4.06294253e-085, -3.47320529e+078,  2.34100711e+255,
       -1.61038977e-228,  1.37554378e-151, -4.32621245e-122,
       -1.10940516e-132, -9.64650402e+259, -5.76779187e+143,
        5.65291584e-218, -2.00431360e-230,  1.88941856e+297,
       -1.51501755e+044,  6.00606503e+091,  6.00132871e-100,
       -1.87984317e-083,  1.76310444e+176, -1.05254743e+162,
        1.82569015e-289,  2.65043541e+306,  7.62356251e+287,
        7.71462371e-204,  4.60326697e+125, -6.40480387e-128,
        1.22730282e-117,  2.65811104e-036, -1.91813167e+191,
        4.77836538e+303,  1.31087216e+029,  2.50214181e+225,
       -1.97948461e-025,  2.06768007e+260, -3.51315475e-064,
       -2.51079189e-017,  2.35410016e+226, -5.89173520e+268,
       -9.09507957e-037, -1.44544941e+119, -1.78004770e+083,
       -1.07267670e-294, -8.34627301e+091, -2.20502303e-203,
       -1.42694819e+252,  2.43210978e-291,  7.45285350e-034,
       -3.98635272e-180,  2.55662269e-075, -7.78645011e+193,
        7.49659229e+195,  2.15807841e-100,  1.00274159e+192,
        7.46077815e+145])
>>> a = cupy.sparse.random(100, 100, format='csr', density=0.9)
>>> a.dot(cupy.ones(a.shape[1]))
INSIDE CUB CSRMV
array([-1.93835463e-004,  2.16301178e-087, -1.27529377e-248,
       -5.91318847e+256,  6.12721527e+048,  3.54993243e+221,
       -2.45782167e-217, -1.13194865e-270,  1.19685764e-307,
       -1.70723501e+048,  3.30074267e+158, -1.06484668e+303,
        1.06033238e-222,  2.44691967e-045,  3.35712358e+018,
        3.68890212e-005, -1.68353056e-237, -5.48242687e-199,
       -3.55293075e+170,  3.06103573e-063,  7.71081785e+115,
        1.34187650e-131,  4.13998181e+201,  2.51162148e-300,
        1.08663275e-083, -5.83086283e-240,  1.71759863e+217,
       -3.27619099e+124,  4.46342889e+232, -2.21296752e-198,
       -1.30422434e+130, -5.10990069e-101, -1.91302098e-222,
       -3.98240656e+124,  1.44387526e+276, -2.88128888e-081,
       -7.41149511e+175, -2.23917292e-093,  3.17862269e-082,
       -1.64747643e+094,  3.17246431e+139,  5.97285105e+051,
        2.86046073e+126, -1.87682192e+213,  8.18379821e+018,
        6.11198604e-268, -2.92639926e-106,  3.91598542e+245,
        4.43163723e+138, -1.95233717e+301,  3.38444604e+072,
       -2.25374508e+227, -2.90772594e+083, -2.50408034e+298,
       -1.21599827e+185, -1.06936930e+029, -1.58090080e+141,
       -1.29898423e+299, -1.63729333e-082, -1.73716602e+068,
       -2.70437597e+066, -1.45541519e-093, -8.97278764e+113,
        1.10328388e-134, -9.52743196e-118,  1.26758494e-247,
       -7.54001140e-084,  4.70757436e+048,  2.44793185e-023,
       -4.20990335e-166,  6.28462675e+245,  2.47417325e-052,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000])

cjnolet on Aug 25, 2020

>>> import cupy as cp
>>> a = cp.sparse.random(10000, 1500, format='coo', density=0.5)
>>> x = cp.random.random(1500)
>>> a * x
CUB is used
array([1.3996321e-317, 5.4763550e-317, 4.1680465e-318, ...,
       1.7190224e-318, 5.3377049e-317, 4.6850723e-317])
>>> a = cp.sparse.random(10000, 1500, format='csr', density=0.5)
>>> a * x
CUB is used
array([6.5346501e-317, 7.1236238e-317, 2.3220685e-317, ...,
       5.0230666e-317, 5.1362121e-317, 6.7755397e-317])

leofang on Aug 25, 2020

@leofang, there does seem to be a threshold at play here, but it seems super small. This might also explain why CI isn’t catching it in the pytests.

>>> a = cupy.sparse.csr_matrix(cupy.ones((20, 20)))
>>> a.dot(cupy.ones(a.shape[1]))
Debug: INSIDE CUB CSRMV
array([7.74860419e-304, 7.74860419e-304, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.00000000e+000])
>>> a = cupy.sparse.csr_matrix(cupy.ones((10, 10)))
>>> a.dot(cupy.ones(a.shape[1]))
Debug: INSIDE CUB CSRMV
array([10., 10., 10., 10., 10., 10., 10., 10., 10., 10.])

Here’s with CUPY_ACCELERATORS=:

>>> import cupy
>>> a = cupy.sparse.csr_matrix(cupy.ones((20, 20)))
>>> a.dot(cupy.ones(a.shape[1]))
array([20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20.,
       20., 20., 20., 20., 20., 20., 20.])

cjnolet on Aug 25, 2020

I can’t think of any recent changes to CUB that might have caused this, but if you can put together a pure small Thrust/CUB-only reproducer I’ll take a look.

alliepiper on Aug 21, 2020

Tagging @allisonvacanti for awareness.

leofang on Aug 21, 2020

Maybe we could set CUB_PATH to somewhere that doesn’t have CUB?

jakirkham on Aug 20, 2020

@leofang, we deployed our own build to the RAPIDS conda so that we could get the CUDA 11 support out before v8.0.0 is released. It sounds like maybe we need another one built without CUB support.

cjnolet on Aug 20, 2020

This line is causing the failure:

>>> a.T.dot(cupy.ones(a.shape[0])).reshape(1, a.shape[1])
array([[0.00000000e+000, 0.00000000e+000, 4.24399158e-314, ...,
        1.10383165e-307, 1.10512862e-307, 1.10636362e-307]])
>>> b.T.dot(cupy.ones(b.shape[0])).reshape(1, b.shape[1])
array([[3056.30047519, 3058.09608272, 3009.4495842 , ..., 1505.96878088,
        1460.47145535, 1505.26624145]])

cjnolet on Aug 20, 2020