tensorflow: SVD on GPU is slower than SVD on CPU

OS:

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS release 7.4.1708
TensorFlow installed from (source or binary): From source
Python version: 2.7.13
Bazel version: 0.6.1
CUDA/cuDNN version: CUDA 8.0/cuDNN 6.0.21
GPU model and memory: GeForce GTX 950M, memory 4GB

output of tf_env_collect.sh


== cat /etc/issue ===============================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright © 2015 Free Software Foundation, Inc.
本程序是自由软件；请参看源代码的版权声明。本软件没有任何担保；
包括没有适销性和某一专用目的下的适用性担保。

== uname -a =====================================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================

== check for virtualenv =========================================
False

== tensorflow import ============================================
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named tensorflow

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Tue Oct 10 16:36:08 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 950M    Off  | 00000000:0A:00.0 Off |                  N/A |
| N/A   45C    P0    N/A /  N/A |      0MiB /  4044MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================

== cat /etc/issue ===============================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright © 2015 Free Software Foundation, Inc.
本程序是自由软件；请参看源代码的版权声明。本软件没有任何担保；
包括没有适销性和某一专用目的下的适用性担保。

== uname -a =====================================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.12.1)
protobuf (3.4.0)
tensorflow (1.4.0rc0)
tensorflow-tensorboard (0.4.0rc1)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.4.0-rc0
tf.GIT_VERSION = v1.3.0-rc1-3111-g4196d6d
tf.COMPILER_VERSION = v1.3.0-rc1-3111-g4196d6d
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/lib64/:/usr/local/cuda/lib64/stubs/:/usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda/nvvm/lib64/:/usr/lib64/nvidia/:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/lib/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64/gcc4.7:/opt/intel/debugger_2017/iga/lib:/opt/intel/debugger_2017/libipt/intel64/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/../tbb/lib/intel64_lin/gcc4.4
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Tue Oct 10 16:36:37 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 950M    Off  | 00000000:0A:00.0 Off |                  N/A |
| N/A   45C    P0    N/A /  N/A |      0MiB /  4044MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.61
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart_static.a

output of python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

('v1.3.0-rc1-3111-g4196d6d', '1.4.0-rc0')

Describe the problem

SVD on GPU is slower than SVD on CPU

Source code / logs

file main.py

import tensorflow as tf
import numpy as np
import sys

D = 1024
dA = np.random.normal(size=(D,D))

dev = "/gpu:0" if len(sys.argv)==1 else "/cpu:0"

with tf.device(dev):
    A = tf.placeholder(shape=(D,D),dtype=tf.float32)
    S, U, V = tf.svd(A)

config = tf.ConfigProto()
config.log_device_placement = True
config.graph_options.optimizer_options.global_jit_level=tf.OptimizerOptions.ON_1
sess = tf.Session(config=config)

for _ in xrange(10):
    dS, dU, dV = sess.run((S, U, V), feed_dict={A:dA})

run on GPU

time python main.py

2017-10-10 16:28:49.047703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-10 16:28:49.048176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.91GiB
2017-10-10 16:28:49.048205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0
2017-10-10 16:28:49.064960: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0

Svd: (Svd): /job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.067234: I tensorflow/core/common_runtime/placer.cc:874] Svd: (Svd)/job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.067302: I tensorflow/core/common_runtime/placer.cc:874] Placeholder: (Placeholder)/job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.074053: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x488e860
python main.py  27.50s user 2.30s system 100% cpu 29.658 total

run on CPU

time python main.py -

2017-10-10 16:29:53.252138: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-10 16:29:53.252572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.91GiB
2017-10-10 16:29:53.252600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0
2017-10-10 16:29:53.269242: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0

Svd: (Svd): /job:localhost/replica:0/task:0/device:CPU:0
2017-10-10 16:29:53.271505: I tensorflow/core/common_runtime/placer.cc:874] Svd: (Svd)/job:localhost/replica:0/task:0/device:CPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
2017-10-10 16:29:53.271544: I tensorflow/core/common_runtime/placer.cc:874] Placeholder: (Placeholder)/job:localhost/replica:0/task:0/device:CPU:0
python main.py -  34.33s user 10.68s system 621% cpu 7.241 total

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 33 (18 by maintainers)

Commits related to this issue

Add cusolver gesvdj and gesvdjBatched to the backend of torch.svd (#48436) Summary: This PR adds cusolver `gesvdj` and `gesvdjBatched` to the backend of `torch.svd`. I've tested the performance usin... — committed to pytorch/pytorch by xwang233 3 years ago

Most upvoted comments

GPU version is so slow right now I don’t think it makes sense for anyone. Here are some numbers from a recent benchmark – https://github.com/yaroslavvb/stuff/tree/master/linalg-benchmark

1534 x 1534 svd, in milliseconds
numpy default        min:   341.51, median:   342.65, mean:   408.34
TF CPU               min:  1279.98, median:  1285.51, mean:  1292.32
TF GPU               min:  6962.91, median:  7006.48, mean:  8967.89
PyTorch CPU          min:  1048.54, median:  1226.51, mean:  1269.30
PyTorch GPU          min:   506.14, median:   511.30, mean:   513.09

yaroslavvb on Sep 3, 2018

Wow, this is a good point. According to this benchmark http://www.netlib.org/lapack/lug/node71.html, GESDD is way faster than GESVD. Sadly, GESDD is not implemented in cuSolver, so we would have to use Magma instead. It appears that Magma is also currently under more active development (If I just look at the publications), and they have some other highly interesting algorithm implementations. Maybe replace cuSolver completely by Magma? Or add a compilation switch on which implementation should be used? The details would then need to be hidden in the cuSolver wrapper (which probably should be renamed then).

shamanDevel on Nov 21, 2017

Did you also benchmark against 2.0a?

sleighsoft on Jun 10, 2019

Hello, I’m back from a long vacation and I’d like to add my piece to the discussion:

Yes, matrix decompositions are very often slower on the GPU than on the CPU. These are simply problems that are hard to parallelize on the GPU architecture.
Yes, Eigen without MKL (that’s what TF uses on the CPU) is slower than numpy with MKL

I would vote for keeping the GPU version of the SVD, because for some CPU-GPU configurations and problem sizes, it is indeed faster. That was the case for me when I proposed the Pull Request. Is it possible to declare the CPU version as the default one and only use the GPU version if explicitly requested? If I’m not wrong, TF uses always the GPU version if available and if nothing else is specified.

shamanDevel on Mar 25, 2018

@hzhangxyz your benchmark is confusing because you have both CPU and GPU SVD in a single session run. It’s better to isolate to a single op if the goal is to show that op kernel is slow.

I have a benchmark in https://github.com/tensorflow/tensorflow/issues/13222#issuecomment-331642490 which isolates to just GPU SVD and rules out memory transfers.

In that example (n=1534, float32), TF CPU runs about 4.6 slower than corresponding version in MKL-enabled numpy, TF GPU version runs about 21x slower.

in commit: https://github.com/tensorflow/tensorflow/commit/22a886b

yaroslavvb on Oct 10, 2017

it seems all kind of decomposition is not proper for gpu? as for qr, cpu is also faster than gpu

hzhangxyz on Mar 10, 2018

Hello, I am having a different problem in that eigen decomposition is extremely slow in TF for both CPU and GPU. Here is a CPU example for a moderate sized matrix (400 x 400):

import numpy as np
import scipy as sp
import tensorflow as tf
from time import time

A = np.random.randn(400, 400)
A_tf = tf.constant(A)

cur = time()
d, v = sp.linalg.eig(A)
print(f'sp: {time() - cur:4.2f} s')

cur = time()
d, v = np.linalg.eig(A)
print(f'np: {time() - cur:4.2f} s')

cur = time()
d, v = tf.linalg.eig(A_tf)
print(f'tf: {time() - cur:4.2f} s')

With output:

sp: 0.09 s
np: 0.08 s
tf: 5.04 s

Any ideas?

seanslice on Sep 4, 2020

NVIDIA has worked on (and are continuing to do so) improving the performance of their symmetric eigensolvers and SVD in CUDA 9, especially 9.1, which also adds batched interfaces. We will switch to using the faster versions and/or the batched interfaces as those versions of CUDA become supported by TensorFlow.

rmlarsen on Apr 5, 2018

According to these release notes PyTorch is using gesdd (via MAGMA?) to do SVD on the GPU whereas TF appears to be using gesvd (via cuSOLVER, based on https://github.com/tensorflow/tensorflow/issues/13222#issuecomment-331621523).

strubell on Nov 21, 2017