tensorflow: CuDNN LSTM failed with large batch size
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro x64
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.2, 2.1
- Python version: 3.7.6
- Bazel version (if compiling from source): -
- GCC/Compiler version (if compiling from source): -
- CUDA/cuDNN version: 10.2, 10.1
- GPU model and memory: 1 x 2080Ti 11GB, 2 x 2080 8GB
Describe the current behavior Sample code below fail with error:
2020-05-12 12:59:37.956635: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll 2020-05-12 12:59:38.245540: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2020-05-12 12:59:39.426585: E tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1847): ‘cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())’ 2020-05-12 12:59:39.446537: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 32, 32, 1, 5, 12928, 32] 2020-05-12 12:59:39.457993: I tensorflow/stream_executor/stream.cc:1990] [stream=0000022BBB859000,impl=0000022BBF43D210] did not wait for [stream=0000022BBB859180,impl=0000022BBF43D930] 2020-05-12 12:59:39.462903: I tensorflow/stream_executor/stream.cc:4938] [stream=0000022BBB859000,impl=0000022BBF43D210] did not memcpy host-to-device; source: 0000022BB566AF00 2020-05-12 12:59:39.467874: F tensorflow/core/common_runtime/gpu/gpu_util.cc:340] CPU->GPU Memcpy failed
If I disable GPU device or set mask_zero to False, or force to use not cudnn LTSM (via tf.keras.layers.RNN(tf.keras.layers.LSTMCell(32))(x)) it works.
I think it’s somehow related with thing what I on each batch process on model not 128 (batch size) items, but 128x101 items.
Describe the expected behavior
It must compile without this error.
Standalone code to reproduce the issue
import tensorflow as tf
from tensorflow.keras.layers import Input, Activation, Embedding, LSTM, Dense, Dropout, Flatten, Concatenate, Dot
from tensorflow.keras.models import Model
from tensorflow.keras.utils import Sequence
#tf.config.set_visible_devices([], 'GPU')
strategy = tf.distribute.MirroredStrategy()
words = tf.constant(((1,1,1,1,1),(1,1,1,1,1)))
products = tf.ones((10000,101), dtype=tf.int32)
test_dataset = tf.data.Dataset.from_tensor_slices(products)
test_dataset = test_dataset.batch(128, drop_remainder=True)
test_dataset = strategy.experimental_distribute_dataset(test_dataset)
def create_model(words_count):
input_words = Input(shape=tf.TensorShape(5), dtype='int32', name='input_words')
x = Embedding(output_dim=32, input_dim=words_count, input_length=5, mask_zero=True)(input_words)
#x = Embedding(output_dim=32, input_dim=words_count, input_length=5, mask_zero=False)(input_words)
x = LSTM(32)(x)
#x = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(32))(x)
model = Model(inputs=input_words, outputs=x)
return model
with strategy.scope():
model = create_model(11111)
optimizer = tf.keras.optimizers.SGD()
print(model.summary())
@tf.function
def test_step(b_cmp):
cmp_words = tf.gather(words, b_cmp)
tmp = tf.reshape(cmp_words, (-1,5))
tmp = model(tmp, training=False)
# ...
r = tf.reduce_sum(tmp)
return r
for b_cmp in test_dataset:
strategy.run(test_step, args=(b_cmp,))
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (11 by maintainers)
I’m also seeing this same error that I can confirm is triggered by the combination of cuDNN + LSTM + masking.
Bisection reveals that this regression was introduced in
2.2.0rc2