tensorflow: Bug: tensorflow-gpu takes long time before beginning to compute

I noticed that tensorflow always takes about ~2min before it actually starts to compute. I’ve been trying to find out, why this happens, and nothing really worked so far.

Tensorflow site says, I should use CUDA® Toolkit 9.0 and cuDNN v7.0. I have CUDA 9.0, so I downloaded CuDNN 7.0.5 for CUDA 9.0 and pasted the files to *C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0*, overwriting the ones form cuDNN 7.1.2, which I tested earlier. To make sure, I pip-installed tensorflow-gpu into a fresh anaconda env. See install here. The issue is still the same.

CUDA works, since it prints the ‘Hello, TensorFlow!’, when I use the official test example, but before that it takes like 2minutes every time!

When I tested this with another wheel (which is linked in this tutorial, I did not compile it myself.) on cuda 9.1/cudnn7.0.5, I had the same issues. A NVIDIA employee on stackoverflow suggested, I may be hitting a lengthy JIT compile step, because the GTX 1080 has compute capability of 6.1, which the wheel I used may not be compiled for.

So I tried to find wheels for tensorflow with compute capability 6.1 for windows, but the only one I found and tested produced the same problem.

Am I doing something wrong here, or do I just have to accept the 2min delay everytime I start my tensorflow/keras scripts?

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Code:

import time
start_time = time.time()
import tensorflow as tf
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))
timer = time.time()
print(timer - start_time)

Output:

(tf_clean) C:\python_code\test>C:/anaconda/envs/tf_clean/python.exe c:/python_code/test/tf_test.py
2018-04-18 14:36:04.376661: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this
TensorFlow binary was not compiled to use: AVX2
2018-04-18 14:36:04.689661: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.60GiB
2018-04-18 14:36:04.699485: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-18 14:38:12.227561: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-18 14:38:12.234504: I T:\src\github\tens2018-04-18 14:38:12.237156: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0:   N
2018-04-18 14:38:12.240997: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6379 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
2018-04-18 14:38:12.548288: I T:\src\github\tensorflow\tensorflow\core\common_runtime\direct_session.cc:297] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.559262: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.564847: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.570545: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]
129.14624643325806

OS Platform and Distribution: Windows 10 Education (Version 10.0.16299 Build 16299) Intel® Core™ i5-7500 CPU @ 3.40GHz, 3408 MHz, 4 Cores
TensorFlow installed from (source or binary): binary
TensorFlow version: tensorflow-gpu 1.5.0, 1.7.0
Python version: 3.5.5 & 3.6 (via anaconda, conda 4.5.1.)
Bazel Version: N/A
CUDA/cuDNN version: Tested combinations: CUDA 9.0 and CuDNN 7.1.2 (tested on tensorflow 1.5.0, 1.7.0 and 1.8.0-dev20180329) CUDA 9.1 and CuDNN 7.0.5 (tested on tensorflow 1.5.0 and 1.7.0)
GPU model and memory: NVIDIA GeForce GTX 1080 (GP104-400) [Hewlett-Packard], 8192 MBytes of GDDR5X SDRAM [Micron]
Exact command to reproduce: See: Have I written custom code…

================================================================= EDIT:

Threadstarter here, hello.

Could you try with the latest nightly? https://files.pythonhosted.org/packages/67/c0/e68a4f0400340b54c887703baa8eee188042c3d65a0cf535dda71abffbc2/tf_nightly_gpu-1.13.0.dev20190205-cp37-cp37m-win_amd64.whl

This works! I checked with that wheel, and then with tf-nightly-gpu-2.0-preview on PYPI, which also worked. I initially wanted to use the anaconda cudatoolkit and cudnn packages, but currently, cudnn is only available up to version 7.3.1 on anaconda-cloud. Tensorflow 2.0 however, is compiled with 7.4.1, so I had to do this the oldschool way, and download the setups from Nvidia. Soon, though…soon.

For everyone, here’s what I did, as a guide:

How to install Tensorflow Nightly 2.0 GPU in Anaconda on Windows 10 x64

• I installed these CUDA/CuDnn Versions: – cuda_10.0.130_win10_network (Nvidia CUDA Download: https://developer.nvidia.com/cuda-toolkit) – cuDNN v7.4.1 (Nov 8, 2018), for CUDA 10.0 (Nvidia CuDnn Download: https://developer.nvidia.com/cudnn) – Don’t forget to check, whether the Cuda setup has correctly written itself to the PATH system variable. – Reboot. • Now make a new environment in Anaconda and activate it: – conda create --name tf2-nightly-gpu python=3.6 – activate tf2-nightly-gpu • Now, with the new env still activated, install the latest Tensorflow 2.0 nightly GPU build from PYPI: – pip install tf-nightly-gpu-2.0-preview • For machine learning in Jupyter notebook (or Jupyter Lab) , you need these as well: – conda install nb_conda matplotlib scipy Pillow pandas scikit-learn • Check, if your GPU is recognized by Tensorflow. Open the Anaconda prompt, activate the new environment and type python, then press Enter. Now type: import tensorflow as tf tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None) • Output should be something like this:

(tf2-nightly-gpu) C:\Users\___>python
>>> import tensorflow as tf
>>> tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
2019-03-19 17:46:25.722209: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-03-19 17:46:25.729724: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2019-03-19 17:46:25.922934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.61GiB
2019-03-19 17:46:25.938231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-19 17:46:26.539185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-19 17:46:26.546009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088]      0
2019-03-19 17:46:26.550123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0:   N
2019-03-19 17:46:26.554188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/device:GPU:0 with 6360 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
True

• Done.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 31
Comments: 108 (32 by maintainers)

Commits related to this issue

Build more cuda compute capabilities in cmake build. Fixes #18652 PiperOrigin-RevId: 205858348 — committed to av8ramit/tensorflow by gunan 6 years ago

Most upvoted comments

my code stop at the following point:

2020-08-10 22:16:25.588882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2020-08-10 22:16:26.719130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-10 22:16:26.719170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2020-08-10 22:16:26.719178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N
2020-08-10 22:16:26.719181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N
2020-08-10 22:16:26.719185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y
2020-08-10 22:16:26.719189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N
2020-08-10 22:16:26.719882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10409 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2020-08-10 22:16:26.720283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2020-08-10 22:16:26.720567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10409 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2020-08-10 22:16:26.720884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10409 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)

it stucks for over 20 minutes, could anybody help me solve this or know the reason of this?

+19

EdwardTse9944 on Aug 10, 2020

Has anything been discovered yet? I have the same problem with ‘Adding visible gpu devices: 0’ taking about 2-3 minutes, even after reboot and multiple runs. I’m using CUDA 9.0 and cuDNN 7.1.2 System: Red Hat Linux GPU: GTX 750Ti

+11

tapioho on Jul 10, 2018

Facing the same issue with Cuda 9.0, tensorflow 1.12.0, cuDNN 7.4, windows 10, Two Nvidia RTX 2080 Tis

+10

omsrisagar on Feb 19, 2019

Same problem here: Ubuntu 16.04 / tensorflow-gpu-1.14 / CUDA 10.0 / cuDNN 7.4 / Python 3.7 / GTX 950M

But just like @steel3d, it happened ONLY on the very first run (stuck for around 3min here). After that, it becomes instant.

damoluo on Jul 9, 2019

I have the same problem using tensorflow 1.13.1, CUDA 10.0, cuDNN 7.5, Windows 10, nVidia 960m.

>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> tf.add(1, 2)
2019-03-05 17:36:49.631611: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-03-05 17:36:49.923538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2019-03-05 17:36:49.930602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-05 17:40:20.836194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-05 17:40:20.840822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-03-05 17:40:20.843478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-03-05 17:40:20.855747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3050 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
<tf.Tensor: id=2, shape=(), dtype=int32, numpy=3>

GregorKovalcik on Mar 5, 2019

I’m having the same issue now. TF 1.15, Cuda 10.0, Cudnn 7, TF was custom compiled with AVX2, XLA, TRT, CC 3.5/3.7/7.0/7.5

I tried to debug it with strace, and found that there’s a futex that locks the execution thread: 18:38:00.532805 futex(0x7f852818fa78, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0 <47.742524>

Another thread starts to work heavily with huge batch of mprotect: 18:38:00.534906 mprotect(0x7f84f98ac000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000030>

During this process I see 2 messages in main thread: 2019-12-03 18:38:16.642624: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-12-03 18:38:26.258193: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

The futex is the released and returns the result.

My test run was like this:

Run new aws p2 instance A from ami, trying to run the code - 1 minute delay
Trying to run the code on A once more - no delay
A is rebooted, trying to run the code - no delay
Run new aws p2 instance B from ami, trying to run the code - 1 minute delay

Seems like something is calculated once, then cached. The cache is preserved upon system restart, but is missing on the first run. It’s also not part of my AMI.

I also tried to rerun the test with official TF 1.15 wheel, but faced the same problem.

May somebody clarify several things?

What is cached?
Where?
Is there any workaround to make this cache part of my ami?

My ~/.nv/ComputeCache is empty.

Maybe it’s related to https://github.com/keras-team/keras/issues/11126, not sure.

Thanks in advance!

AndreyOrb on Dec 3, 2019

similar issue, adding device takes a few minutes. GPU 840M, python 3.6, CUDA 9.0, CUDNN 7.4.2, tensorflow 1.12.0

maxiwu on Jan 3, 2019

I have the exact same problem. It take around 5 minutes at: Adding visible gpu devices: 0

My environment is Win10, tensorflow-gpu-2.0-beta1， CUDA 10.0, cuDNN 7.6, python 3.6 and with GTX 850M

When the problem will be fixed？

LittleFatHero on Jun 30, 2019

os.environ[‘TF_CPP_MIN_LOG_LEVEL’]=‘2’

This works for me as well.

prabindh on Mar 31, 2019

Hi all, I’m having the same problem… waiting time of about 2 minutes before running what I actually wants to run. The text below is what I get and what I see for two minutes:

Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0] on linux Type “help”, “copyright”, “credits” or “license” for more information. >>> import tensorflow as tf >>> tf.Session(config=tf.ConfigProto(log_device_placement=True)) 2018-09-05 09:54:50.130623: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-09-05 09:54:50.374925: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-09-05 09:54:50.375571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: GeForce 940M major: 5 minor: 0 memoryClockRate(GHz): 1.176 pciBusID: 0000:01:00.0 totalMemory: 1.96GiB freeMemory: 1.93GiB 2018-09-05 09:54:50.375588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0

After the waiting time is finally over I get the rest of the execution:

2018-09-05 09:58:35.611421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-05 09:58:35.611455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2018-09-05 09:58:35.611462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2018-09-05 09:58:35.611629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1687 MB memory) -> physical GPU (device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0, compute capability: 5.0) Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0, compute capability: 5.0 2018-09-05 09:58:35.623962: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0, compute capability: 5.0

<tensorflow.python.client.session.Session object at 0x7f0556917668>

I’m using Ubuntu 18.04, Nvidia Driver 396.54, andrunning the script under an anaconda environment with Python 3.6.6, cuda 9.2 and tensorflow-gpu 1.10.0

How do I solve this? Thanks, Boris

apolo74 on Sep 5, 2018

Having the exact same problem with TF 1.13.1 built from sources that was working perfectly before, the only thing I changed was nvidia drivers from nvidia-415 to 418. Could it have something to do with this ?

I just confirmed that this was exactly the problem, it was not about the TF version, the problem persisted across all build versions including the latest nightly. I’m using arch, so the latest upgrade installed the linux kernel 5.0 and the latest nvidia drivers 418*.

What I did was downgrade both the drivers to nvidia 415.27-9 and the kernel/headers to linux 4.20.11

TF no longer hangs on tf.Session()

A list of driver versions compatible with cuda can be found in the CUDA release notes: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

davidenunes on Mar 13, 2019

Let me take another look at this, if all the issues are resolved, hopefully with a single change we can include all necessary compute capabilities.

gunan on Feb 1, 2019

The same problem GeForce RTX 2080, Tensorflow 1.13.0 RC, CUDA 10.0, CUDNN 7, Windows 8.1.

BackT0TheFuture on Feb 1, 2019

@steel3d Any updates/ workarounds you managed to find? Running into a similar situation as you, with a Tesla T4, Ubuntu 18.04.5 LTS, on AWS. Built tensorflow from scratch for sm 7.5, which helped, but it still takes more time on the first run compared to the successive ones. If it’s any help, the process in question that the time difference is for involves loading some saved model protobufs as well.

RohanGautam on Feb 3, 2021

my code stop at the following point:

2020-08-10 22:16:25.588882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2020-08-10 22:16:26.719130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-10 22:16:26.719170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2020-08-10 22:16:26.719178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N
2020-08-10 22:16:26.719181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N
2020-08-10 22:16:26.719185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y
2020-08-10 22:16:26.719189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N
2020-08-10 22:16:26.719882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10409 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2020-08-10 22:16:26.720283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2020-08-10 22:16:26.720567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10409 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2020-08-10 22:16:26.720884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10409 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)

it stucks for over 20 minutes, could anybody help me solve this or know the reason of this?

The same issue. Did you have any effective solution? @RayerXie

mrluin on Oct 1, 2020

@gunan Kindly asking for an update since #19198 is resolved. Is it official workaround to recompile tf? We are having the problem with Tesla K80

mordka on Jan 30, 2019

I have the same issue: a timeout of exactly 2 minutes before computation starts. Is it perhaps related to “Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA” ? I’m using
host: ubuntu 18.04 container: tensorflow/tensorflow:latest-gpu

root@76611d5f5dd1:/notebooks# python /usr/local/bin/validate_installation.py 
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.flo
at64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-06-18 10:15:12.462431: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-18 10:15:12.672108: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-06-18 10:15:12.672988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:02:00.0
totalMemory: 1.96GiB freeMemory: 1.93GiB
2018-06-18 10:15:12.673024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-18 10:17:11.465729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-18 10:17:11.465769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-06-18 10:17:11.465778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-06-18 10:17:11.466050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1695 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:02:00.0, compute capability: 5.0)
Hello, TensorFlow!

the script I used to install nvidia-docker after a fresh installation of ubuntu 18.04:

# Install packages to allow apt to use a repository over HTTPS:
sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

sudo apt-get update

# docker-ce not yet ready -> docker.io
#
apt -y install docker.io


echo blacklist nouveau >> /etc/modprobe.d/blacklist-nouveau.conf
echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nouveau.conf

sudo update-initramfs -u
# reboot


sudo apt-get install dkms build-essential make

sudo dpkg --add-architecture i386
sudo apt update
sudo apt -y install libc6:i386

sudo bash NVIDIA-Linux-x86_64-390.67.run --dkms --install-libglvnd

# https://nvidia.github.io/nvidia-docker/

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

sudo apt -y install nvidia-docker2

apt -y install nvidia-utils

ludwigprager on Jun 18, 2018

@AnishKumarNayak I’m now on 2.6.0 - the issue is gone but I don’t know if it was the Tensorflow version or something else.

stefan-falk on Feb 24, 2022

I think if the cache can be made persistent is a more nvidia question? To make it a part of your AMI, you can rebuild TF with all the compute capabilities that your AMI will potentially need. AWS P2 uses k80 GPUs, which needs compute capability 3.7. You may need other compute capabilities based on other GPUs available to you.

I am closing this issue, as this is all known, and documented. To sum up: If you see this error, that means your GPU has a Cuda compute capability TF binary you are using does not have packaged in. To work through the problem, you will need to first check which compute capability your GPU needs here: https://developer.nvidia.com/cuda-gpus Then rebuild TF from sources with that compute capability enabled (which you select during configure)

gunan on Apr 21, 2020

Sorry about that, I didn’t know about that nvprune restriction.

Could you please try to change these lines to:

nvccopts += r'-gencode=arch=compute_%s,\"code=sm_%s" ' % (capability, capability)

And try again with just TF_CUDA_COMPUTE_CAPABILITIES=7.0 (i.e. without 7.5)?

You will need to use nvcc for the compiler for this to work (you can check that .bazelrc.user specifies --config=cuda and not --config=cuda_clang).

Thanks for you help.

chsigg on Dec 2, 2019

This should be fixed with 1.13.1

gunan on May 10, 2019

Having the exact same problem with TF 1.13.1 built from sources that was working perfectly before, the only thing I changed was nvidia drivers from nvidia-415 to 418. Could it have something to do with this ?

davidenunes on Mar 9, 2019

RTX 2070 CUDA 9.0 cuDNN 7.0 tensorflow-gpu 1.50 遇到了相同的问题，在CMD会话框中运行需要1分钟

pcprinciple on Dec 18, 2018

i added below codes(refer to https://docs.google.com/presentation/d/1iO_bBL_5REuDQ7RJ2F35vH2BxAiGMocLC6t_N-6eXaE/edit#slide=id.g1df700e686_0_13), this phenomenon seems disappeared, i do not know the reason import os os.environ[‘TF_CPP_MIN_LOG_LEVEL’]=‘2’

TomHJ on Dec 6, 2018

The JIT cache does seem to hang around for the entire Python process scope.

For those going in through Jupyter, on reset you hit the delay. But, subsequent part re-runs hit the existing JIT values. Just don’t reset your notebook after init and it is pretty fast.

toddpi314 on Dec 4, 2018

The issue is because windows build uses MSVC, and other builds use clang or gcc. The eigen bug surfaces with nvcc+msvc. You can build from sources and set compute capability, but you will run into #19198.

gunan on Oct 11, 2018

Why does it happen on my windows install but not my linux? (Same hardware)

EDIT: nevermind, because windows is built via cmake, and linux via bazel.

Is it possible to set our specific compute capability while building from source? Would that solve the problem?

dan-garvey on Oct 10, 2018

A-ha, I think I may have an idea. In our bazel builds, we have all the cuda compute capabilities built into the binaries we distribute. However, it is possible we are not doing that with cmake! I will take another look.

gunan on Jul 24, 2018