cupy: Performance measurements - `cp.matmul` slower than `torch.matmul`
I just installed cupy and did some simple performance benchmarks for comparisons. I chose matrix multiplication since it’s the simplest problem to start with.
I measure cupy time with a following code snipped:
@contextmanager
def timing():
import cupy
class Foo: # dummy class to pass results out of contextmanager
pass
res = Foo()
start = cupy.cuda.Event(disable_timing=False)
end = cupy.cuda.Event(disable_timing=False)
start.record()
yield res
end.record()
#I'm not sure about this line, just guessed by analogy from torch
# Without it raises DeviceNotReady erorr
end.synchronize()
res.result = cupy.cuda.get_elapsed_time(start, end)/1000
I have similar script for measuring time in torch, based on this link: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964 The above contextmanager is used in the following way:
x, y = cupy.random.rand(N,N, dtype=cupy.float32), cupy.random.rand(N,N, dtype=cupy.float32)
with timing() as t:
z = cupy.matmul(x,y)
print('time', t.result)
Link to the full measurement script: https://github.com/danlkv/QTensor/tree/merged_ix/scratchpad/bench/matmul
I know that torch also uses Cuda, so I would expect the time for torch and cupy to be similar, since most of the work is done by same (?) cuda backend function. I do 10 runs of square NxN matrix, here are some of the results I get:
| library | size of matrix | FLOP/s | average time of 10 runs |
|---|---|---|---|
| torch, | 2000, | 879.23G | 0.00909883852005005, |
| torch, | 3000, | 1.06T | 0.02553908138275146, |
| torch, | 3001, | 1.22T | 0.02214127025604248, |
| cupy, | 2000, | 690.78G | 0.011581078433990479, |
| cupy, | 3000, | 741.55G | 0.036410070610046384, |
| cupy, | 3001, | 720.97G | 0.037487194061279296, |
So it looks like torch somehow gets ~50% faster… Also it gets 15% faster for size 3000 vs 3001, which is strange, but not related to cupy I guess.
My guess would be that some time is spent on data transfer, to the GPU, and while I don’t include .on('cuda') in torch measures, cupy does the tensor movements inside cupy.matmul.
System specs
- System: WSL Ubuntu 20.04
- Cupy
» python -c 'import cupy; cupy.show_config()'
OS : Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.29
CuPy Version : 8.6.0
NumPy Version : 1.19.4
SciPy Version : 1.3.3
Cython Build Version : 0.29.22
CUDA Root : /usr/local/cuda
CUDA Build Version : 11020
CUDA Driver Version : 11030
CUDA Runtime Version : 11020
cuBLAS Version : 11401
cuFFT Version : 10401
cuRAND Version : 10203
cuSOLVER Version : (11, 1, 0)
cuSPARSE Version : 11401
NVRTC Version : (11, 2)
Thrust Version : 101000
CUB Build Version : 101000
cuDNN Build Version : None
cuDNN Version : None
NCCL Build Version : 2804
NCCL Runtime Version : 2804
cuTENSOR Version : None
Device 0 Name : NVIDIA GeForce GTX 1650 with Max-Q Design
Device 0 Compute Capability : 75
- Cuda 11.2, nvidia driver 470.14
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 3
- Comments: 37 (21 by maintainers)
I just traced both matmuls and they take the same amount of time.
I run your script and saw the time discrepancies, however, these are gone if you create the cublas handler beforehand. In the first iteration, in CuPy the cublas handler is created, and it takes a lot of time. We defer this creation because the handler eats up a significant amount of GPU memory, while in pytorch all these handlers are created on import time, (this is part of why PyTorch consumes a lot of GPU memory once you import it).
If you create it with before calling the actual matmul
cupy.cuda.device.get_cublas_handle()Your script will get better timings.
Just a wild thought: Could it be possible that you have multiple CUDA installations on your system, and CuPy and PyTorch accidentally picked up different versions? How were CuPy and PyTorch installed?
btw @huaxuan250 your PyTorch and CuPy versions do not exist…maybe typos?
If you use
cupyx.time.repeat()to do timing, it’d do the warmup runs for you and so the handle creation time will be correctly excluded.@huaxuan250 Can you please run the benchmarks as python scripts using Nsight? Like this:
nsys profile python bench.pyHi! Getting the traces is the most important step because right now we can’t figure what is going on 😭
I can’t run
nsys-ui, since WSL doesn’t have X server (or at least I don’t have one set up). But what I did is installed nsight on the Windows host and opened the file from WSL. And it doesn’t look like there’s any GPU displayed there either… But when I run the code the GPU is clearly working, bothnvidia-smiand windows task manager show load and the fan makes noises.From Help>About
Version: 2021.2.1.58-642947b Windows-x64. Qt version: 5.14.1. Google Protocol Buffers version: 3.10.0. Boost version: 1.70.0.
Let me know if you want me to run any additional tests. Meanwhile, I did tests for other datatypes, and it looks like cupy is faster for complex64 and the same for both float64 and complex128. So only for float32 cupy is slower. Full data: https://charts.mongodb.com/charts-project-0-logoj/public/dashboards/a2276780-242f-4915-8e43-9336374f2ef3