tensorflow: tf.signal.fft2d speed is slow and unstable in RTX2080Ti

System information

OS Platform: Linux Ubuntu 18.04 (server)
TensorFlow installed from: docker tensorflow/tensorflow 1.14.0-gpu-py3-jupyter
TensorFlow version: 1.14.0
Python version: 3.6.8
CUDA: v10.0
GPU model: RTX 2080Ti

Describe the current behavior The speed of fft2d operation tf.signal.fft2d is very unstable at different iterations. Here is an example output of time every 100 iterations (code is shown below):

2019-09-04 19:03:57.731947
2019-09-04 19:04:33.715335
2019-09-04 19:05:10.976109
2019-09-04 19:05:44.012072
2019-09-04 19:06:15.616308
2019-09-04 19:07:14.961716
2019-09-04 19:08:12.324199
2019-09-04 19:09:11.560423
2019-09-04 19:10:08.877960
2019-09-04 19:11:08.102977

During the training, there is no other programs running.

Bur when running the same code in my local machine (GTX 1080Ti) with the same TensorFlow docker image. The speed is fast and stable.

2019-09-04 14:12:00.387114
2019-09-04 14:12:07.363174
2019-09-04 14:12:14.355784
2019-09-04 14:12:21.384377
2019-09-04 14:12:28.378524

Describe the expected behavior The speed should be always very fast (about 7s/100iterations).

Code to reproduce the issue

import os
import datetime
import tensorflow as tf
import numpy as np

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

def main(_):
    dp_train = tf.placeholder(tf.float32, shape=(41, 300, 300, 1))
    dp = tf.complex(dp_train, 0.0)
    dp_fft = tf.abs(tf.signal.fft2d(dp))
    W = tf.get_variable('W_conv', [1, 1, 1, 1])  # not important, just want to make the training process to run.
    cost_train = tf.reduce_mean(tf.nn.conv2d(dp_fft, W, strides=[1, 1, 1, 1], padding='SAME'))

    opt = tf.train.AdamOptimizer(1e-4)
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        train_op = opt.minimize(cost_train)  # var_list=vars_digital

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.local_variables_initializer())
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess, coord=coord)

        for i in range(0, 5000):
            data = np.random.rand(41,300,300,1)
            sess.run(train_op, feed_dict={dp_train:data})
            if i % 100 ==0:
                old_time = datetime.datetime.now()
                print(old_time)

        coord.request_stop()
        coord.join(threads)

if __name__ == '__main__':
    tf.app.run()

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 25 (14 by maintainers)

Most upvoted comments

@YichengWu we are discussing this internally and don’t really know what could be causing it. We may have seen similar issues on V100/P100 GPUs – in those cases, profiling suggests all the time is spent within the CUDA API (in the request to launch the kernel).

One hypothesis is that due to an architecture mismatch between the bundled kernels and your GPU, CUDA may be JIT’ing a specialized kernel on the fly. There is a cache for JIT’ed kernels, and we are wondering if something is causing CUDA to need to regularly re-compile the kernels needed for your program.

There are some configuration options for the JIT and cache here: https://devblogs.nvidia.com/cuda-pro-tip-understand-fat-binaries-jit-caching/

In particular, I’m wondering what happens when you set CUDA_CACHE_DISABLE=1 and CUDA_FORCE_PTX_JIT=1. If that reproduces the slow behavior then we can probably conclude that something is causing your CUDA kernel cache to churn.

rryan on Oct 16, 2019