tensorflow: [TF2.0-nightly] GRU/LSTM layers don't use cuDNN properly

System information

Have I written custom code : Yes
OS Platform and Distribution :
- Ubuntu 16.04 + Docker 18.09.6-ce
- Arch Linux 5.1.5
TensorFlow installed from : pip install tf-nightly-gpu-2.0-preview
TensorFlow version : 2.0.0-dev20190606, but every nightly since 2.0.0-dev20190319 presents the same behaviour.
Python version: 3.6.8
CUDA/cuDNN version: CUDA V10.0.130 / cuDNN 7.5.0.56
GPU model and memory:
- Nvidia GTX 980Ti (6GB)
- Nivida GTX 1070 (8GB)

Describe the current behavior

GRU/LSTM layers don’t use the cuDNN implementation properly, resulting in much worse performance. Let’s take for example this toy network :

# Imports
import numpy as np
import tensorflow as tf
tf.executing_eagerly()
print('TensorFlow version: ' + str(tf.__version__))

# Print checks
from tensorflow.python.eager import context
print('Executing eagerly? : ' + str(context.executing_eagerly()))
print('Number of GPUs: ' + str(context.num_gpus()))

# Generate random data
X = np.random.rand(6720,700,3)
y = X[:,1,1]
print('Shapes: ', X.shape, y.shape)

# Define toy network
input_shape = X.shape[2]
rnn_state_size = 1
timesteps = X.shape[1]

inputs = tf.keras.layers.Input(shape=[timesteps, input_shape], dtype=np.float32)
output = tf.keras.layers.LSTM(rnn_state_size)(inputs)
model = tf.keras.Model(inputs, output)
model.compile('rmsprop', 'mse')
print(model.summary())

# Fit
model.fit(X,y)

With the last nightly this is what we obtain:

TensorFlow version: 2.0.0-dev20190606
Executing eagerly? : True
2019-06-06 12:52:23.635654: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-06-06 12:52:23.660930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1658] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:42:00.0
2019-06-06 12:52:23.661142: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-06 12:52:23.661983: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-06 12:52:23.662749: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-06 12:52:23.662937: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-06 12:52:23.663896: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-06 12:52:23.664621: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-06 12:52:23.667023: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-06 12:52:23.667936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1781] Adding visible gpu devices: 0
2019-06-06 12:52:23.668222: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-06 12:52:23.756255: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b18abf73d0 executing computations on platform CUDA. Devices:
2019-06-06 12:52:23.756289: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2019-06-06 12:52:23.758641: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3494060000 Hz
2019-06-06 12:52:23.759820: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b18aff2990 executing computations on platform Host. Devices:
2019-06-06 12:52:23.759845: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-06-06 12:52:23.760484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1658] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:42:00.0
2019-06-06 12:52:23.760515: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-06 12:52:23.760527: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-06 12:52:23.760537: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-06 12:52:23.760547: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-06 12:52:23.760557: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-06 12:52:23.760567: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-06 12:52:23.760577: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-06 12:52:23.761521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1781] Adding visible gpu devices: 0
2019-06-06 12:52:23.761549: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-06 12:52:23.762256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-06 12:52:23.762272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1205]      0 
2019-06-06 12:52:23.762280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1218] 0:   N 
2019-06-06 12:52:23.763253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6407 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:42:00.0, compute capability: 6.1)
Number of GPUs: 1
Shapes:  (6720, 700, 3) (6720,)
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 700, 3)]          0         
_________________________________________________________________
lstm (LSTM)                  (None, 1)                 20        
=================================================================
Total params: 20
Trainable params: 20
Non-trainable params: 0
_________________________________________________________________
None
Train on 6720 samples
2019-06-06 12:52:26.219667: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
6720/6720 [==============================] - 114s 17ms/sample - loss: 0.1441

Which is much slower than what we obtained with version 2.0.0-dev20190319 and previous (including version 2.0-alpha) :

TensorFlow version: 2.0.0-dev20190319
Executing eagerly? : True
2019-06-06 13:23:14.360714: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-06 13:23:14.379231: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-06-06 13:23:14.500580: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558a01550ac0 executing computations on platform CUDA. Devices:
2019-06-06 13:23:14.500637: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2019-06-06 13:23:14.525050: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3494060000 Hz
2019-06-06 13:23:14.526497: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558a01662bb0 executing computations on platform Host. Devices:
2019-06-06 13:23:14.526541: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-06-06 13:23:14.526816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:42:00.0
totalMemory: 7.92GiB freeMemory: 6.59GiB
2019-06-06 13:23:14.526860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-06-06 13:23:14.526931: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-06 13:23:14.527880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-06 13:23:14.527903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088]      0 
2019-06-06 13:23:14.527925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0:   N 
2019-06-06 13:23:14.528098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6407 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:42:00.0, compute capability: 6.1)
Number of GPUs: 1
Shapes:  (6720, 700, 3) (6720,)
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 700, 3)]          0         
_________________________________________________________________
lstm (LSTM)                  (None, 1)                 20        
=================================================================
Total params: 20
Trainable params: 20
Non-trainable params: 0
_________________________________________________________________
None
2019-06-06 13:23:16.864613: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
6720/6720 [==============================] - 6s 884us/sample - loss: 0.1065

Other info / logs

I have tried in different computers and I am able to reproduce the issue. With the modifications from this pull request, I obtain the same performance in the last nightly as in 2.0.0-dev20190319, but with the advantage of being able to use cuDNN with masking, which was added by @qlzh727 in this commit.

I am willing to contribute to solve this issue in a better way if you would like me to. Thanks!

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 18 (9 by maintainers)

Commits related to this issue

Partially fix the function inlining and performance regression for LSTM/GRU. 1. Force the defun graph to not inline, so that grappler can properly do the rewrite. This will fix the codelab performanc... — committed to tensorflow/tensorflow by qlzh727 5 years ago
Partially fix the function inlining and performance regression for LSTM/GRU. 1. Force the defun graph to not inline, so that grappler can properly do the rewrite. This will fix the codelab performanc... — committed to sleighsoft/tensorflow by qlzh727 5 years ago

Most upvoted comments

@mr-ubik , the beta release didn’t contain my latest fix yet, sorry for the breakage. We are cherrypicking my fix into beta1, which will probably released within this week. If you need to access the latest fix now, you can use tf-nightly-gpu-2.0-preview

qlzh727 on Jun 12, 2019

Hi @dbuades, currently the change was blocked by another issue of tf.cond for device placement. It is been actively working on, and we should address it before the formal release.

qlzh727 on Jul 7, 2019

Should be fixed by e691be7. Will be available in the next nightly build.

Thanks for the fix! It is not yet implemented in today’s nightly, so I modified recurrent_v2.py manually to test it. The network is learning correctly and it just takes 4 seconds per epoch now. However, I’m getting this error:

W tensorflow/core/grappler/optimizers/implementation_selector.cc:196] Skipping optimization due to error while loading function libraries: Invalid argument: Functions '__inference___backward_cudnn_lstm_357_535' and '__inference___backward_cudnn_lstm_357_535_specialized_for_RMSprop_gradients_lstm_StatefulPartitionedCall_grad_StatefulPartitionedCall_at___inference_keras_scratch_graph_1515' both implement 'lstm_54689970-be31-4336-9a58-a64ddb74d552' but their signatures do not match.

dbuades on Jun 7, 2019

Thanks for reporting the issue. Let me check the details and and see if the kernel is actaully landed on GPU or not. Will reply when I have more findings.

qlzh727 on Jun 6, 2019

I got same problem today, what should i do with keras

dinhngoc267 on Aug 4, 2020

I think I found the issue, submitting the fix now.

qlzh727 on Jun 6, 2019