cupy: cub.device_csrmv is corrupting sparse arrays
Here’s a reproducer of the behavior we are seeing on cuML. What’s really strange is that this doesn’t seem to be happening consistently across Linux versions & GPUs.
>>> a = cupy.sparse.random(10000, 1500, format='coo', density=0.5)
>>> b = a.tocsr()
>>> cupy.testing.assert_array_equal(a.col, b.indices)
>>> cupy.testing.assert_array_equal(a.data, b.data)
>>> a.sum(axis=0)
array([[0.00000000e+000, 0.00000000e+000, 4.24399158e-314, ...,
1.10383165e-307, 1.10512862e-307, 1.10636362e-307]])
>>> b.sum(axis=0)
array([[3056.30047519, 3058.09608272, 3009.4495842 , ..., 1505.96878088,
1460.47145535, 1505.26624145]])
We know the CSR (b) is correct because the COO that’s yielding incorrect results in one of our primitives.
Original cuml issue: https://github.com/rapidsai/cuml/issues/2724
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 1
- Comments: 50 (49 by maintainers)
This seems to be the commit where it broke: https://github.com/NVlabs/cub/commit/b2e64cf0fb4ea7ace6c86ca6765ca7c1087ef82e
@leofang, the interesting thing is that the only code path that seems to use cub’s csrmv is when the dot product is taken between a coo matrix and a dense array. A different code path is used when a dot product is taken between a CSR and a dense vector. Is it possible that exact code path is not tested? We wouldn’t have caught this in our pytests on cuml had
cupy.sparse.randomnot returned a coo by default (and the temporary fix for us was to just have cupy.sparse.random output a CSR).I’m on my phone right now but I can look through the cupy pytests when I get in front of a computer next.
@leofang, removing just the
cub.cppfiles was a HUGE savings. Thank you for mentioning that trick!It looks like reverting the commit from my previous comment fixes the issue!
Of course, I don’t know why the change was made and I don’t know if reverting it breaks anything else in cub or cupy, but at least we have a handle on the problem now.
I am able to reproduce the bug on the
masterbranch of cub. I guess I could start a git bisect to try to isolate the exact commitCupy 7.8.0 CUDA11 works w/ CUB 1.8.0!
That supports your theory that it’s a bug in CUDA11’s CUB.
@leofang, ah ha! I was able to reproduce the bug w/ v8.0.0 using
CUPY_ACCELERATORS=cub:@kmaehashi @emcastillo So I looked into this today and found it’s accidentally fixed upstream (NVIDIA/cub#249), see https://github.com/NVIDIA/cub/pull/196#issuecomment-782293149. But, the fix won’t land in CUDA 11.3, meaning all CUDA 11.x users are impacted.
Given that Thrust and CUB are coupled tighter since CUDA 11.0, I propose two solutions for CUDA 11.x:
Both solutions require passing the flag
-DTHRUST_IGNORE_CUB_VERSION_CHECK=1to the compiler to ignore incompatible Thrust version, otherwise we’ll also have to bundle Thrust.@leofang, I just found out the different code path I was seeing is from the
a.T, which converts the csr into a csc.I admit I hadn’t verified whether CSR & COO were still taking different code paths in 8.0.0. It’s good to know those have been consolidated!
Okay, it does appear that in CuPy 8.0.0, both CSR and COO are using the same csrmv code path. This was not the case for Cupy 7.8.0. Here’s my results for cupy 8.0.0 with CUB with a smaller matrix:
@leofang, there does seem to be a threshold at play here, but it seems super small. This might also explain why CI isn’t catching it in the pytests.
Here’s with
CUPY_ACCELERATORS=:I can’t think of any recent changes to CUB that might have caused this, but if you can put together a pure small Thrust/CUB-only reproducer I’ll take a look.
Tagging @allisonvacanti for awareness.
Maybe we could set
CUB_PATHto somewhere that doesn’t have CUB?@leofang, we deployed our own build to the RAPIDS conda so that we could get the CUDA 11 support out before v8.0.0 is released. It sounds like maybe we need another one built without CUB support.
This line is causing the failure: