tensorflow: tf.signal.fft2d speed is slow and unstable in RTX2080Ti
System information
- OS Platform: Linux Ubuntu 18.04 (server)
- TensorFlow installed from: docker tensorflow/tensorflow 1.14.0-gpu-py3-jupyter
- TensorFlow version: 1.14.0
- Python version: 3.6.8
- CUDA: v10.0
- GPU model: RTX 2080Ti
Describe the current behavior
The speed of fft2d operation tf.signal.fft2d
is very unstable at different iterations. Here is an example output of time every 100 iterations (code is shown below):
2019-09-04 19:03:57.731947
2019-09-04 19:04:33.715335
2019-09-04 19:05:10.976109
2019-09-04 19:05:44.012072
2019-09-04 19:06:15.616308
2019-09-04 19:07:14.961716
2019-09-04 19:08:12.324199
2019-09-04 19:09:11.560423
2019-09-04 19:10:08.877960
2019-09-04 19:11:08.102977
During the training, there is no other programs running.
Bur when running the same code in my local machine (GTX 1080Ti) with the same TensorFlow docker image. The speed is fast and stable.
2019-09-04 14:12:00.387114
2019-09-04 14:12:07.363174
2019-09-04 14:12:14.355784
2019-09-04 14:12:21.384377
2019-09-04 14:12:28.378524
Describe the expected behavior The speed should be always very fast (about 7s/100iterations).
Code to reproduce the issue
import os
import datetime
import tensorflow as tf
import numpy as np
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
def main(_):
dp_train = tf.placeholder(tf.float32, shape=(41, 300, 300, 1))
dp = tf.complex(dp_train, 0.0)
dp_fft = tf.abs(tf.signal.fft2d(dp))
W = tf.get_variable('W_conv', [1, 1, 1, 1]) # not important, just want to make the training process to run.
cost_train = tf.reduce_mean(tf.nn.conv2d(dp_fft, W, strides=[1, 1, 1, 1], padding='SAME'))
opt = tf.train.AdamOptimizer(1e-4)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = opt.minimize(cost_train) # var_list=vars_digital
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
for i in range(0, 5000):
data = np.random.rand(41,300,300,1)
sess.run(train_op, feed_dict={dp_train:data})
if i % 100 ==0:
old_time = datetime.datetime.now()
print(old_time)
coord.request_stop()
coord.join(threads)
if __name__ == '__main__':
tf.app.run()
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 25 (14 by maintainers)
@YichengWu we are discussing this internally and don’t really know what could be causing it. We may have seen similar issues on V100/P100 GPUs – in those cases, profiling suggests all the time is spent within the CUDA API (in the request to launch the kernel).
One hypothesis is that due to an architecture mismatch between the bundled kernels and your GPU, CUDA may be JIT’ing a specialized kernel on the fly. There is a cache for JIT’ed kernels, and we are wondering if something is causing CUDA to need to regularly re-compile the kernels needed for your program.
There are some configuration options for the JIT and cache here: https://devblogs.nvidia.com/cuda-pro-tip-understand-fat-binaries-jit-caching/
In particular, I’m wondering what happens when you set
CUDA_CACHE_DISABLE=1
andCUDA_FORCE_PTX_JIT=1
. If that reproduces the slow behavior then we can probably conclude that something is causing your CUDA kernel cache to churn.