cupy: Cupy function doesn't utilize pinned memory inside stream

  • Conditions CuPy Version : 7.2.0 CUDA Root : /usr/common/software/cuda/10.1.243 CUDA Build Version : 10010 CUDA Driver Version : 10020 CUDA Runtime Version : 10010 cuBLAS Version : 10202 cuFFT Version : 10102 cuRAND Version : 10102 cuSOLVER Version : (10, 3, 0) cuSPARSE Version : 10301 NVRTC Version : (10, 1) cuDNN Build Version : 7605 cuDNN Version : 7605 NCCL Build Version : 2506 NCCL Runtime Version : 2506

  • Code to reproduce

import numpy as np
import cupy as cp
import cupy.linalg
import cupyx.scipy.special
import cupyx as cpx

stream_1 = cp.cuda.stream.Stream()
with stream_1:
    cp.random.seed(1)
    A = cp.random.rand(10000, 10000)
    u, v = cp.linalg.eigh(cpx.scipy.sparse.csr_matrix(A).todense())
  • Error messages, stack traces, or logs By profiling the above code, I observe that there are many small bursts of cudaMemcpy2DAsyncs happening in eigh, despite never explicitly requesting cupy to transfer data back. I am putting the cupy call in a stream. How do I force cupy to use pinned memory efficiently? Screenshot from 2020-03-04 18-41-52 eigh_profile5.qdrep.zip

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (11 by maintainers)

Most upvoted comments

FYI this was opened as a bug internally in NVIDIA.

Looks like those data transfers are made outside of CuPy (likely in cuSPARSE or cuSOLVER). IIUC almost all CuPy internal kernels are prefixed with cupy_ (or cupyx_), but I don’t see any in those transfers.

I am not pretty sure here, but the issue might be cuSOLVER doing data transfers? CuPy has a pinned memory pool used for its data transfers. But we can’t guarantee what happens inside CUDA libraries.

cc. @pentschev @anaruse

Reference : https://docs-cupy.chainer.org/en/stable/reference/memory.html

Thank you, we appreciate it!

As this is related to CUDA libraries more than CuPy, we will close this issue.

@jakirkham V100 with 16GB of memory A more detailed description about the configuration can be found here: https://docs-dev.nersc.gov/cgpu/hardware/

Ok good to see that performance improvement at least.

Not sure that’s needed yet.

Am talking to someone who knows this a bit better to get some more insight into what is going on here.