tensorflow: SVD on GPU is slower than SVD on CPU
OS:
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS release 7.4.1708
- TensorFlow installed from (source or binary): From source
- Python version: 2.7.13
- Bazel version: 0.6.1
- CUDA/cuDNN version: CUDA 8.0/cuDNN 6.0.21
- GPU model and memory: GeForce GTX 950M, memory 4GB
output of tf_env_collect.sh
== cat /etc/issue ===============================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright © 2015 Free Software Foundation, Inc.
本程序是自由软件;请参看源代码的版权声明。本软件没有任何担保;
包括没有适销性和某一专用目的下的适用性担保。
== uname -a =====================================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
== check for virtualenv =========================================
False
== tensorflow import ============================================
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: No module named tensorflow
== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Tue Oct 10 16:36:08 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 950M Off | 00000000:0A:00.0 Off | N/A |
| N/A 45C P0 N/A / N/A | 0MiB / 4044MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
== cat /etc/issue ===============================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright © 2015 Free Software Foundation, Inc.
本程序是自由软件;请参看源代码的版权声明。本软件没有任何担保;
包括没有适销性和某一专用目的下的适用性担保。
== uname -a =====================================================
Linux zhanghao 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy (1.12.1)
protobuf (3.4.0)
tensorflow (1.4.0rc0)
tensorflow-tensorboard (0.4.0rc1)
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.4.0-rc0
tf.GIT_VERSION = v1.3.0-rc1-3111-g4196d6d
tf.COMPILER_VERSION = v1.3.0-rc1-3111-g4196d6d
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/lib64/:/usr/local/cuda/lib64/stubs/:/usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda/nvvm/lib64/:/usr/lib64/nvidia/:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/lib/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64/gcc4.7:/opt/intel/debugger_2017/iga/lib:/opt/intel/debugger_2017/libipt/intel64/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/../tbb/lib/intel64_lin/gcc4.4
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Tue Oct 10 16:36:37 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 950M Off | 00000000:0A:00.0 Off | N/A |
| N/A 45C P0 N/A / N/A | 0MiB / 4044MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.61
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart_static.a
output of python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
('v1.3.0-rc1-3111-g4196d6d', '1.4.0-rc0')
Describe the problem
SVD on GPU is slower than SVD on CPU
Source code / logs
file main.py
import tensorflow as tf
import numpy as np
import sys
D = 1024
dA = np.random.normal(size=(D,D))
dev = "/gpu:0" if len(sys.argv)==1 else "/cpu:0"
with tf.device(dev):
A = tf.placeholder(shape=(D,D),dtype=tf.float32)
S, U, V = tf.svd(A)
config = tf.ConfigProto()
config.log_device_placement = True
config.graph_options.optimizer_options.global_jit_level=tf.OptimizerOptions.ON_1
sess = tf.Session(config=config)
for _ in xrange(10):
dS, dU, dV = sess.run((S, U, V), feed_dict={A:dA})
run on GPU
time python main.py
2017-10-10 16:28:49.047703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-10 16:28:49.048176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.91GiB
2017-10-10 16:28:49.048205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0
2017-10-10 16:28:49.064960: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0
Svd: (Svd): /job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.067234: I tensorflow/core/common_runtime/placer.cc:874] Svd: (Svd)/job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.067302: I tensorflow/core/common_runtime/placer.cc:874] Placeholder: (Placeholder)/job:localhost/replica:0/task:0/device:GPU:0
2017-10-10 16:28:49.074053: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x488e860
python main.py 27.50s user 2.30s system 100% cpu 29.658 total
run on CPU
time python main.py -
2017-10-10 16:29:53.252138: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-10 16:29:53.252572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.91GiB
2017-10-10 16:29:53.252600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0
2017-10-10 16:29:53.269242: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0
Svd: (Svd): /job:localhost/replica:0/task:0/device:CPU:0
2017-10-10 16:29:53.271505: I tensorflow/core/common_runtime/placer.cc:874] Svd: (Svd)/job:localhost/replica:0/task:0/device:CPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
2017-10-10 16:29:53.271544: I tensorflow/core/common_runtime/placer.cc:874] Placeholder: (Placeholder)/job:localhost/replica:0/task:0/device:CPU:0
python main.py - 34.33s user 10.68s system 621% cpu 7.241 total
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 33 (18 by maintainers)
GPU version is so slow right now I don’t think it makes sense for anyone. Here are some numbers from a recent benchmark – https://github.com/yaroslavvb/stuff/tree/master/linalg-benchmark
Wow, this is a good point. According to this benchmark http://www.netlib.org/lapack/lug/node71.html, GESDD is way faster than GESVD. Sadly, GESDD is not implemented in cuSolver, so we would have to use Magma instead. It appears that Magma is also currently under more active development (If I just look at the publications), and they have some other highly interesting algorithm implementations. Maybe replace cuSolver completely by Magma? Or add a compilation switch on which implementation should be used? The details would then need to be hidden in the cuSolver wrapper (which probably should be renamed then).
Did you also benchmark against 2.0a?
Hello, I’m back from a long vacation and I’d like to add my piece to the discussion:
I would vote for keeping the GPU version of the SVD, because for some CPU-GPU configurations and problem sizes, it is indeed faster. That was the case for me when I proposed the Pull Request. Is it possible to declare the CPU version as the default one and only use the GPU version if explicitly requested? If I’m not wrong, TF uses always the GPU version if available and if nothing else is specified.
@hzhangxyz your benchmark is confusing because you have both CPU and GPU SVD in a single session run. It’s better to isolate to a single op if the goal is to show that op kernel is slow.
I have a benchmark in https://github.com/tensorflow/tensorflow/issues/13222#issuecomment-331642490 which isolates to just GPU SVD and rules out memory transfers.
In that example (n=1534, float32), TF CPU runs about 4.6 slower than corresponding version in MKL-enabled numpy, TF GPU version runs about 21x slower.
in commit: https://github.com/tensorflow/tensorflow/commit/22a886b
it seems all kind of decomposition is not proper for gpu? as for qr, cpu is also faster than gpu
Hello, I am having a different problem in that eigen decomposition is extremely slow in TF for both CPU and GPU. Here is a CPU example for a moderate sized matrix (400 x 400):
With output:
Any ideas?
NVIDIA has worked on (and are continuing to do so) improving the performance of their symmetric eigensolvers and SVD in CUDA 9, especially 9.1, which also adds batched interfaces. We will switch to using the faster versions and/or the batched interfaces as those versions of CUDA become supported by TensorFlow.
According to these release notes PyTorch is using gesdd (via MAGMA?) to do SVD on the GPU whereas TF appears to be using gesvd (via cuSOLVER, based on https://github.com/tensorflow/tensorflow/issues/13222#issuecomment-331621523).