tensorflow: CuDNN LSTM failed with large batch size

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro x64
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.2, 2.1
Python version: 3.7.6
Bazel version (if compiling from source): -
GCC/Compiler version (if compiling from source): -
CUDA/cuDNN version: 10.2, 10.1
GPU model and memory: 1 x 2080Ti 11GB, 2 x 2080 8GB

Describe the current behavior Sample code below fail with error:

2020-05-12 12:59:37.956635: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll 2020-05-12 12:59:38.245540: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2020-05-12 12:59:39.426585: E tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1847): ‘cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())’ 2020-05-12 12:59:39.446537: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 32, 32, 1, 5, 12928, 32] 2020-05-12 12:59:39.457993: I tensorflow/stream_executor/stream.cc:1990] [stream=0000022BBB859000,impl=0000022BBF43D210] did not wait for [stream=0000022BBB859180,impl=0000022BBF43D930] 2020-05-12 12:59:39.462903: I tensorflow/stream_executor/stream.cc:4938] [stream=0000022BBB859000,impl=0000022BBF43D210] did not memcpy host-to-device; source: 0000022BB566AF00 2020-05-12 12:59:39.467874: F tensorflow/core/common_runtime/gpu/gpu_util.cc:340] CPU->GPU Memcpy failed

If I disable GPU device or set mask_zero to False, or force to use not cudnn LTSM (via tf.keras.layers.RNN(tf.keras.layers.LSTMCell(32))(x)) it works.

I think it’s somehow related with thing what I on each batch process on model not 128 (batch size) items, but 128x101 items.

Describe the expected behavior

It must compile without this error.

Standalone code to reproduce the issue

import tensorflow as tf

from tensorflow.keras.layers import Input, Activation, Embedding, LSTM, Dense, Dropout, Flatten, Concatenate, Dot
from tensorflow.keras.models import Model
from tensorflow.keras.utils import Sequence

#tf.config.set_visible_devices([], 'GPU')

strategy = tf.distribute.MirroredStrategy()

words = tf.constant(((1,1,1,1,1),(1,1,1,1,1)))
products = tf.ones((10000,101), dtype=tf.int32)

test_dataset = tf.data.Dataset.from_tensor_slices(products)
test_dataset = test_dataset.batch(128, drop_remainder=True)
test_dataset = strategy.experimental_distribute_dataset(test_dataset)

def create_model(words_count):
    input_words = Input(shape=tf.TensorShape(5), dtype='int32', name='input_words')

    x = Embedding(output_dim=32, input_dim=words_count, input_length=5, mask_zero=True)(input_words)
    #x = Embedding(output_dim=32, input_dim=words_count, input_length=5, mask_zero=False)(input_words)

    x = LSTM(32)(x)
    #x = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(32))(x)

    model = Model(inputs=input_words, outputs=x)
    return model

with strategy.scope():
    model = create_model(11111)
    optimizer = tf.keras.optimizers.SGD()

print(model.summary())

@tf.function
def test_step(b_cmp):
    cmp_words = tf.gather(words, b_cmp)
    tmp = tf.reshape(cmp_words, (-1,5))
    tmp = model(tmp, training=False)
   
    # ...

    r = tf.reduce_sum(tmp)
    return r

for b_cmp in test_dataset:
    strategy.run(test_step, args=(b_cmp,))

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (11 by maintainers)

Most upvoted comments

I’m also seeing this same error that I can confirm is triggered by the combination of cuDNN + LSTM + masking.

Bisection reveals that this regression was introduced in 2.2.0rc2

ned2 on May 20, 2020