tensorflow: CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian/Sid (2020-07-01), Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): source and binary
  • TensorFlow version (use command below): 1.15
  • Python version: 3.6, 3.7.8
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): 9.0
  • CUDA/cuDNN version: 10.0/7.4.1 ; 10.0/7.4.2.1 ; 10.0/7.5.1.10 ; 10.0/7.6.5.32
  • GPU model and memory: 2x RTX 2080 Ti ; 4x GTX 1080 Ti ;

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)": v1.15.3-0-g4386a6640c

Describe the current behavior Training with some dataset triggers:

2020-07-22 16:15:42.108252: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 16:15:42.108385: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 

Describe the expected behavior Training should succeed, or TensorFlow or CUDNN should expose a more actionable error

Standalone code to reproduce the issue Will be provided after.

Other info / logs Will be provided after. Some noisy debugging session can be seen at https://github.com/mozilla/DeepSpeech/issues/3088

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 35 (29 by maintainers)

Commits related to this issue

Most upvoted comments

Right, I think Google only fixes major issues for 1.15. Can you first create a PR against master? And later they might pick it for 1.15 if necessary? @sanjoy

Yep, it’s done: https://github.com/tensorflow/tensorflow/pull/41832

Great catch !

Thanks, no offense but I’ll claim victory once I get feedback from @kaixih 😃

@lissyx Looks very plausible to be the root cause to me! From a code point of view and it also fits with and explains all the patterns we saw while testing and debugging this issue. The same code seems also still present in TF2.x and master, which correlates with all the other reports you found about LSTM with CUDA/CUDNN being unstable while training and are likely related. Great catch !

Can you try to fetch the cudnn logs with the following env vars? And attach the somefile.log (If this is too large, we might only need the last part which contains the cudnnRNNForwardTrainingEx). @lissyx

export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=somefile.log

More details: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging