cupy: Performance measurements - `cp.matmul` slower than `torch.matmul`

I just installed cupy and did some simple performance benchmarks for comparisons. I chose matrix multiplication since it’s the simplest problem to start with.

I measure cupy time with a following code snipped:

@contextmanager
def timing():
    import cupy
    class Foo: # dummy class to pass results out of contextmanager
        pass
    res = Foo()
    start = cupy.cuda.Event(disable_timing=False)
    end = cupy.cuda.Event(disable_timing=False)
    start.record()
    yield res
    end.record()

    #I'm not sure about this line, just guessed by analogy from torch
    # Without it raises DeviceNotReady erorr
    end.synchronize()
    res.result = cupy.cuda.get_elapsed_time(start, end)/1000

I have similar script for measuring time in torch, based on this link: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964 The above contextmanager is used in the following way:

x, y = cupy.random.rand(N,N, dtype=cupy.float32), cupy.random.rand(N,N, dtype=cupy.float32)
with timing() as t:
    z = cupy.matmul(x,y)
print('time', t.result)

Link to the full measurement script: https://github.com/danlkv/QTensor/tree/merged_ix/scratchpad/bench/matmul

I know that torch also uses Cuda, so I would expect the time for torch and cupy to be similar, since most of the work is done by same (?) cuda backend function. I do 10 runs of square NxN matrix, here are some of the results I get:

library	size of matrix	FLOP/s	average time of 10 runs
torch,	2000,	879.23G	0.00909883852005005,
torch,	3000,	1.06T	0.02553908138275146,
torch,	3001,	1.22T	0.02214127025604248,
cupy,	2000,	690.78G	0.011581078433990479,
cupy,	3000,	741.55G	0.036410070610046384,
cupy,	3001,	720.97G	0.037487194061279296,

So it looks like torch somehow gets ~50% faster… Also it gets 15% faster for size 3000 vs 3001, which is strange, but not related to cupy I guess.

My guess would be that some time is spent on data transfer, to the GPU, and while I don’t include .on('cuda') in torch measures, cupy does the tensor movements inside cupy.matmul.

System specs

System: WSL Ubuntu 20.04
Cupy

» python -c 'import cupy; cupy.show_config()'
OS                           : Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.29
CuPy Version                 : 8.6.0
NumPy Version                : 1.19.4
SciPy Version                : 1.3.3
Cython Build Version         : 0.29.22
CUDA Root                    : /usr/local/cuda
CUDA Build Version           : 11020
CUDA Driver Version          : 11030
CUDA Runtime Version         : 11020
cuBLAS Version               : 11401
cuFFT Version                : 10401
cuRAND Version               : 10203
cuSOLVER Version             : (11, 1, 0)
cuSPARSE Version             : 11401
NVRTC Version                : (11, 2)
Thrust Version               : 101000
CUB Build Version            : 101000
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : 2804
NCCL Runtime Version         : 2804
cuTENSOR Version             : None
Device 0 Name                : NVIDIA GeForce GTX 1650 with Max-Q Design
Device 0 Compute Capability  : 75

Cuda 11.2, nvidia driver 470.14

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 3
Comments: 37 (21 by maintainers)

Most upvoted comments

I just traced both matmuls and they take the same amount of time.

I run your script and saw the time discrepancies, however, these are gone if you create the cublas handler beforehand. In the first iteration, in CuPy the cublas handler is created, and it takes a lot of time. We defer this creation because the handler eats up a significant amount of GPU memory, while in pytorch all these handlers are created on import time, (this is part of why PyTorch consumes a lot of GPU memory once you import it).

If you create it with before calling the actual matmul cupy.cuda.device.get_cublas_handle()

Your script will get better timings.

emcastillo on Apr 14, 2021

Just a wild thought: Could it be possible that you have multiple CUDA installations on your system, and CuPy and PyTorch accidentally picked up different versions? How were CuPy and PyTorch installed?

btw @huaxuan250 your PyTorch and CuPy versions do not exist…maybe typos?

leofang on Jul 1, 2021

If you use cupyx.time.repeat() to do timing, it’d do the warmup runs for you and so the handle creation time will be correctly excluded.

leofang on Apr 14, 2021

@huaxuan250 Can you please run the benchmarks as python scripts using Nsight? Like this: nsys profile python bench.py

danlkv on Jul 1, 2021

Hi! Getting the traces is the most important step because right now we can’t figure what is going on 😭

emcastillo on Jul 1, 2021

I can’t run nsys-ui, since WSL doesn’t have X server (or at least I don’t have one set up). But what I did is installed nsight on the Windows host and opened the file from WSL. And it doesn’t look like there’s any GPU displayed there either… But when I run the code the GPU is clearly working, both nvidia-smi and windows task manager show load and the fan makes noises.

From Help>About

Version: 2021.2.1.58-642947b Windows-x64. Qt version: 5.14.1. Google Protocol Buffers version: 3.10.0. Boost version: 1.70.0.

danlkv on Apr 21, 2021

Let me know if you want me to run any additional tests. Meanwhile, I did tests for other datatypes, and it looks like cupy is faster for complex64 and the same for both float64 and complex128. So only for float32 cupy is slower. Full data: https://charts.mongodb.com/charts-project-0-logoj/public/dashboards/a2276780-242f-4915-8e43-9336374f2ef3

danlkv on Apr 11, 2021