tensorflow: CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian/Sid (2020-07-01), Ubuntu 18.04
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
- TensorFlow installed from (source or binary): source and binary
- TensorFlow version (use command below): 1.15
- Python version: 3.6, 3.7.8
- Bazel version (if compiling from source): 0.26.1
- GCC/Compiler version (if compiling from source): 9.0
- CUDA/cuDNN version: 10.0/7.4.1 ; 10.0/7.4.2.1 ; 10.0/7.5.1.10 ; 10.0/7.6.5.32
- GPU model and memory: 2x RTX 2080 Ti ; 4x GTX 1080 Ti ;
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:
- TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)": v1.15.3-0-g4386a6640c
Describe the current behavior Training with some dataset triggers:
2020-07-22 16:15:42.108252: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 16:15:42.108385: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]
Describe the expected behavior Training should succeed, or TensorFlow or CUDNN should expose a more actionable error
Standalone code to reproduce the issue Will be provided after.
Other info / logs Will be provided after. Some noisy debugging session can be seen at https://github.com/mozilla/DeepSpeech/issues/3088
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 35 (29 by maintainers)
Commits related to this issue
- Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
- Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
- Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
- Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
- Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
- Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
- Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to lissyx/tensorflow by lissyx 4 years ago
- Merge pull request #42634 from lissyx/update-r1.15-issue41630 Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to tensorflow/tensorflow by mihaimaruseac 4 years ago
- Fix #41630: include max_seq_length in cudnn descriptor cache key — committed to chenyu-jiang/tensorflow by lissyx 4 years ago
- Revert "Fix #41630: include max_seq_length in cudnn descriptor cache key" This reverts commit cc3e5a02d4f623a9ad23f0bc330a984ddabfa728. — committed to chenyu-jiang/tensorflow by chenyu-jiang 4 years ago
Yep, it’s done: https://github.com/tensorflow/tensorflow/pull/41832
Thanks, no offense but I’ll claim victory once I get feedback from @kaixih 😃
@lissyx Looks very plausible to be the root cause to me! From a code point of view and it also fits with and explains all the patterns we saw while testing and debugging this issue. The same code seems also still present in TF2.x and master, which correlates with all the other reports you found about LSTM with CUDA/CUDNN being unstable while training and are likely related. Great catch !
Can you try to fetch the cudnn logs with the following env vars? And attach the somefile.log (If this is too large, we might only need the last part which contains the cudnnRNNForwardTrainingEx). @lissyx
More details: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging