tensorflow: Bug: tensorflow-gpu takes long time before beginning to compute
I noticed that tensorflow always takes about ~2min before it actually starts to compute. I’ve been trying to find out, why this happens, and nothing really worked so far.
Tensorflow site says, I should use CUDA® Toolkit 9.0 and cuDNN v7.0. I have CUDA 9.0, so I downloaded CuDNN 7.0.5 for CUDA 9.0 and pasted the files to *C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0*, overwriting the ones form cuDNN 7.1.2, which I tested earlier. To make sure, I pip-installed tensorflow-gpu into a fresh anaconda env. See install here. The issue is still the same.
CUDA works, since it prints the ‘Hello, TensorFlow!’, when I use the official test example, but before that it takes like 2minutes every time!
When I tested this with another wheel (which is linked in this tutorial, I did not compile it myself.) on cuda 9.1/cudnn7.0.5, I had the same issues. A NVIDIA employee on stackoverflow suggested, I may be hitting a lengthy JIT compile step, because the GTX 1080 has compute capability of 6.1, which the wheel I used may not be compiled for.
So I tried to find wheels for tensorflow with compute capability 6.1 for windows, but the only one I found and tested produced the same problem.
Am I doing something wrong here, or do I just have to accept the 2min delay everytime I start my tensorflow/keras scripts?
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Code:
import time
start_time = time.time()
import tensorflow as tf
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))
timer = time.time()
print(timer - start_time)
Output:
(tf_clean) C:\python_code\test>C:/anaconda/envs/tf_clean/python.exe c:/python_code/test/tf_test.py
2018-04-18 14:36:04.376661: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this
TensorFlow binary was not compiled to use: AVX2
2018-04-18 14:36:04.689661: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.60GiB
2018-04-18 14:36:04.699485: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-18 14:38:12.227561: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-18 14:38:12.234504: I T:\src\github\tens2018-04-18 14:38:12.237156: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N
2018-04-18 14:38:12.240997: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6379 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
2018-04-18 14:38:12.548288: I T:\src\github\tensorflow\tensorflow\core\common_runtime\direct_session.cc:297] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.559262: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.564847: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.570545: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
[49. 64.]]
129.14624643325806
-
OS Platform and Distribution: Windows 10 Education (Version 10.0.16299 Build 16299) Intel® Core™ i5-7500 CPU @ 3.40GHz, 3408 MHz, 4 Cores
-
TensorFlow installed from (source or binary): binary
-
TensorFlow version: tensorflow-gpu 1.5.0, 1.7.0
-
Python version: 3.5.5 & 3.6 (via anaconda, conda 4.5.1.)
-
Bazel Version: N/A
-
CUDA/cuDNN version: Tested combinations: CUDA 9.0 and CuDNN 7.1.2 (tested on tensorflow 1.5.0, 1.7.0 and 1.8.0-dev20180329) CUDA 9.1 and CuDNN 7.0.5 (tested on tensorflow 1.5.0 and 1.7.0)
-
GPU model and memory: NVIDIA GeForce GTX 1080 (GP104-400) [Hewlett-Packard], 8192 MBytes of GDDR5X SDRAM [Micron]
-
Exact command to reproduce: See: Have I written custom code…
================================================================= EDIT:
Threadstarter here, hello.
Could you try with the latest nightly? https://files.pythonhosted.org/packages/67/c0/e68a4f0400340b54c887703baa8eee188042c3d65a0cf535dda71abffbc2/tf_nightly_gpu-1.13.0.dev20190205-cp37-cp37m-win_amd64.whl
This works! I checked with that wheel, and then with tf-nightly-gpu-2.0-preview
on PYPI, which also worked.
I initially wanted to use the anaconda cudatoolkit and cudnn packages, but currently, cudnn is only available up to version 7.3.1 on anaconda-cloud. Tensorflow 2.0 however, is compiled with 7.4.1, so I had to do this the oldschool way, and download the setups from Nvidia.
Soon, though…soon.
For everyone, here’s what I did, as a guide:
How to install Tensorflow Nightly 2.0 GPU in Anaconda on Windows 10 x64
• I installed these CUDA/CuDnn Versions:
– cuda_10.0.130_win10_network (Nvidia CUDA Download: https://developer.nvidia.com/cuda-toolkit)
– cuDNN v7.4.1 (Nov 8, 2018), for CUDA 10.0 (Nvidia CuDnn Download: https://developer.nvidia.com/cudnn)
– Don’t forget to check, whether the Cuda setup has correctly written itself to the PATH system variable.
– Reboot.
• Now make a new environment in Anaconda and activate it:
– conda create --name tf2-nightly-gpu python=3.6
– activate tf2-nightly-gpu
• Now, with the new env still activated, install the latest Tensorflow 2.0 nightly GPU build from PYPI:
– pip install tf-nightly-gpu-2.0-preview
• For machine learning in Jupyter notebook (or Jupyter Lab) , you need these as well:
– conda install nb_conda matplotlib scipy Pillow pandas scikit-learn
• Check, if your GPU is recognized by Tensorflow. Open the Anaconda prompt, activate the new environment and type python
, then press Enter. Now type:
import tensorflow as tf
tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
• Output should be something like this:
(tf2-nightly-gpu) C:\Users\___>python
>>> import tensorflow as tf
>>> tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
2019-03-19 17:46:25.722209: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-03-19 17:46:25.729724: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2019-03-19 17:46:25.922934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.61GiB
2019-03-19 17:46:25.938231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-19 17:46:26.539185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-19 17:46:26.546009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088] 0
2019-03-19 17:46:26.550123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0: N
2019-03-19 17:46:26.554188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/device:GPU:0 with 6360 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
True
• Done.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 31
- Comments: 108 (32 by maintainers)
Commits related to this issue
- Build more cuda compute capabilities in cmake build. Fixes #18652 PiperOrigin-RevId: 205858348 — committed to av8ramit/tensorflow by gunan 6 years ago
my code stop at the following point:
it stucks for over 20 minutes, could anybody help me solve this or know the reason of this?
Has anything been discovered yet? I have the same problem with ‘Adding visible gpu devices: 0’ taking about 2-3 minutes, even after reboot and multiple runs. I’m using CUDA 9.0 and cuDNN 7.1.2 System: Red Hat Linux GPU: GTX 750Ti
Facing the same issue with Cuda 9.0, tensorflow 1.12.0, cuDNN 7.4, windows 10, Two Nvidia RTX 2080 Tis
Same problem here: Ubuntu 16.04 / tensorflow-gpu-1.14 / CUDA 10.0 / cuDNN 7.4 / Python 3.7 / GTX 950M
But just like @steel3d, it happened ONLY on the very first run (stuck for around 3min here). After that, it becomes instant.
I have the same problem using tensorflow 1.13.1, CUDA 10.0, cuDNN 7.5, Windows 10, nVidia 960m.
I’m having the same issue now. TF 1.15, Cuda 10.0, Cudnn 7, TF was custom compiled with AVX2, XLA, TRT, CC 3.5/3.7/7.0/7.5
I tried to debug it with strace, and found that there’s a futex that locks the execution thread: 18:38:00.532805 futex(0x7f852818fa78, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0 <47.742524>
Another thread starts to work heavily with huge batch of mprotect: 18:38:00.534906 mprotect(0x7f84f98ac000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000030>
During this process I see 2 messages in main thread: 2019-12-03 18:38:16.642624: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-12-03 18:38:26.258193: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
The futex is the released and returns the result.
My test run was like this:
Seems like something is calculated once, then cached. The cache is preserved upon system restart, but is missing on the first run. It’s also not part of my AMI.
I also tried to rerun the test with official TF 1.15 wheel, but faced the same problem.
May somebody clarify several things?
My ~/.nv/ComputeCache is empty.
Maybe it’s related to https://github.com/keras-team/keras/issues/11126, not sure.
Thanks in advance!
similar issue, adding device takes a few minutes. GPU 840M, python 3.6, CUDA 9.0, CUDNN 7.4.2, tensorflow 1.12.0
I have the exact same problem. It take around 5 minutes at: Adding visible gpu devices: 0
My environment is Win10, tensorflow-gpu-2.0-beta1, CUDA 10.0, cuDNN 7.6, python 3.6 and with GTX 850M
When the problem will be fixed?
This works for me as well.
Hi all, I’m having the same problem… waiting time of about 2 minutes before running what I actually wants to run. The text below is what I get and what I see for two minutes:
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0] on linux Type “help”, “copyright”, “credits” or “license” for more information.
>>> import tensorflow as tf
>>> tf.Session(config=tf.ConfigProto(log_device_placement=True))
2018-09-05 09:54:50.130623: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-05 09:54:50.374925: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-05 09:54:50.375571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: GeForce 940M major: 5 minor: 0 memoryClockRate(GHz): 1.176 pciBusID: 0000:01:00.0 totalMemory: 1.96GiB freeMemory: 1.93GiB
2018-09-05 09:54:50.375588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
After the waiting time is finally over I get the rest of the execution:
2018-09-05 09:58:35.611421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-05 09:58:35.611455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-09-05 09:58:35.611462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2018-09-05 09:58:35.611629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1687 MB memory) -> physical GPU (device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0, compute capability: 5.0)
Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0, compute capability: 5.0
2018-09-05 09:58:35.623962: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0, compute capability: 5.0
<tensorflow.python.client.session.Session object at 0x7f0556917668>
I’m using Ubuntu 18.04, Nvidia Driver 396.54, andrunning the script under an anaconda environment with Python 3.6.6, cuda 9.2 and tensorflow-gpu 1.10.0
How do I solve this? Thanks, Boris
I just confirmed that this was exactly the problem, it was not about the TF version, the problem persisted across all build versions including the latest nightly. I’m using arch, so the latest upgrade installed the linux kernel 5.0 and the latest nvidia drivers 418*.
What I did was downgrade both the drivers to nvidia 415.27-9 and the kernel/headers to linux 4.20.11
TF no longer hangs on tf.Session()
A list of driver versions compatible with cuda can be found in the CUDA release notes: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
Let me take another look at this, if all the issues are resolved, hopefully with a single change we can include all necessary compute capabilities.
The same problem GeForce RTX 2080, Tensorflow 1.13.0 RC, CUDA 10.0, CUDNN 7, Windows 8.1.
@steel3d Any updates/ workarounds you managed to find? Running into a similar situation as you, with a Tesla T4, Ubuntu 18.04.5 LTS, on AWS. Built tensorflow from scratch for
sm 7.5
, which helped, but it still takes more time on the first run compared to the successive ones. If it’s any help, the process in question that the time difference is for involves loading some saved model protobufs as well.The same issue. Did you have any effective solution? @RayerXie
@gunan Kindly asking for an update since #19198 is resolved. Is it official workaround to recompile tf? We are having the problem with Tesla K80
I have the same issue: a timeout of exactly 2 minutes before computation starts. Is it perhaps related to “Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA” ? I’m using
host: ubuntu 18.04 container: tensorflow/tensorflow:latest-gpu
the script I used to install nvidia-docker after a fresh installation of ubuntu 18.04:
@AnishKumarNayak I’m now on 2.6.0 - the issue is gone but I don’t know if it was the Tensorflow version or something else.
I think if the cache can be made persistent is a more nvidia question? To make it a part of your AMI, you can rebuild TF with all the compute capabilities that your AMI will potentially need. AWS P2 uses k80 GPUs, which needs compute capability 3.7. You may need other compute capabilities based on other GPUs available to you.
I am closing this issue, as this is all known, and documented. To sum up: If you see this error, that means your GPU has a Cuda compute capability TF binary you are using does not have packaged in. To work through the problem, you will need to first check which compute capability your GPU needs here: https://developer.nvidia.com/cuda-gpus Then rebuild TF from sources with that compute capability enabled (which you select during configure)
Sorry about that, I didn’t know about that nvprune restriction.
Could you please try to change these lines to:
nvccopts += r'-gencode=arch=compute_%s,\"code=sm_%s" ' % (capability, capability)
And try again with just
TF_CUDA_COMPUTE_CAPABILITIES=7.0
(i.e. without 7.5)?You will need to use nvcc for the compiler for this to work (you can check that .bazelrc.user specifies
--config=cuda
and not--config=cuda_clang
).Thanks for you help.
This should be fixed with 1.13.1
Having the exact same problem with TF 1.13.1 built from sources that was working perfectly before, the only thing I changed was nvidia drivers from nvidia-415 to 418. Could it have something to do with this ?
RTX 2070 CUDA 9.0 cuDNN 7.0 tensorflow-gpu 1.50 遇到了相同的问题,在CMD会话框中运行需要1分钟
i added below codes(refer to https://docs.google.com/presentation/d/1iO_bBL_5REuDQ7RJ2F35vH2BxAiGMocLC6t_N-6eXaE/edit#slide=id.g1df700e686_0_13), this phenomenon seems disappeared, i do not know the reason import os os.environ[‘TF_CPP_MIN_LOG_LEVEL’]=‘2’
The JIT cache does seem to hang around for the entire Python process scope.
For those going in through Jupyter, on reset you hit the delay. But, subsequent part re-runs hit the existing JIT values. Just don’t reset your notebook after init and it is pretty fast.
The issue is because windows build uses MSVC, and other builds use clang or gcc. The eigen bug surfaces with nvcc+msvc. You can build from sources and set compute capability, but you will run into #19198.
Why does it happen on my windows install but not my linux? (Same hardware)
EDIT: nevermind, because windows is built via cmake, and linux via bazel.
Is it possible to set our specific compute capability while building from source? Would that solve the problem?
A-ha, I think I may have an idea. In our bazel builds, we have all the cuda compute capabilities built into the binaries we distribute. However, it is possible we are not doing that with cmake! I will take another look.