tensorflow: CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian/Sid (2020-07-01), Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): source and binary
TensorFlow version (use command below): 1.15
Python version: 3.6, 3.7.8
Bazel version (if compiling from source): 0.26.1
GCC/Compiler version (if compiling from source): 9.0
CUDA/cuDNN version: 10.0/7.4.1 ; 10.0/7.4.2.1 ; 10.0/7.5.1.10 ; 10.0/7.6.5.32
GPU model and memory: 2x RTX 2080 Ti ; 4x GTX 1080 Ti ;

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)": v1.15.3-0-g4386a6640c

Describe the current behavior Training with some dataset triggers:

2020-07-22 16:15:42.108252: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 16:15:42.108385: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]

Describe the expected behavior Training should succeed, or TensorFlow or CUDNN should expose a more actionable error

Standalone code to reproduce the issue Will be provided after.

Other info / logs Will be provided after. Some noisy debugging session can be seen at https://github.com/mozilla/DeepSpeech/issues/3088

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 35 (29 by maintainers)

Commits related to this issue

Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
Merge pull request #42634 from lissyx/update-r1.15-issue41630 Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to tensorflow/tensorflow by mihaimaruseac 4 years ago
Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to chenyu-jiang/tensorflow by lissyx 4 years ago
Revert "Fix #41630: include max_seq_length in cudnn descriptor cache key" This reverts commit cc3e5a02d4f623a9ad23f0bc330a984ddabfa728. — committed to chenyu-jiang/tensorflow by chenyu-jiang 4 years ago

Most upvoted comments

Right, I think Google only fixes major issues for 1.15. Can you first create a PR against master? And later they might pick it for 1.15 if necessary? @sanjoy

Yep, it’s done: https://github.com/tensorflow/tensorflow/pull/41832

lissyx on Jul 28, 2020

Great catch !

Thanks, no offense but I’ll claim victory once I get feedback from @kaixih 😃

lissyx on Jul 28, 2020

@lissyx Looks very plausible to be the root cause to me! From a code point of view and it also fits with and explains all the patterns we saw while testing and debugging this issue. The same code seems also still present in TF2.x and master, which correlates with all the other reports you found about LSTM with CUDA/CUDNN being unstable while training and are likely related. Great catch !

applied-machinelearning on Jul 28, 2020

Can you try to fetch the cudnn logs with the following env vars? And attach the somefile.log (If this is too large, we might only need the last part which contains the cudnnRNNForwardTrainingEx). @lissyx

export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=somefile.log

More details: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging

kaixih on Jul 23, 2020