tensorflow: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_INTERNAL_ERROR
I am unable to train my model. I get the below error after around the 100th epoch or so (randomly). Sometimes it fails on the 500th or so epoch. There are a few times I won’t get this error.
Python Version: 3.7 Tesorflow Version: 2.3 Tensorflow Installed: From binary. OS: Windows 10 GPU Card: Nvidia Titan V CUDA: 10.1 CUDNN: 7.6.5.32
: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2020-07-07 17:15:37.551112: E tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_INTERNAL_ERROR in tensorflow/stream_executor/cuda/cuda_dnn.cc(1867): ‘cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())’ 2020-07-07 17:15:37.551176: F tensorflow/stream_executor/cuda/cuda_dnn.cc:189] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below):
- Python version:
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory:
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:
- TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" - TF 2.0:
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
Describe the expected behavior
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 122 (18 by maintainers)
Probably NVIDIA needs to be involved also for this to be fixed.
I think I may have finally resolved this issue. It seems I had two CUDA versions installed: 10.1, and 10.2. My installation was pointing to the 10.2 version. After uninstalling all the other other versions and having only 10.1 I was able to finally train more than once without issues. I will run a couple more tests. Bottom line: After uninstalling all my previous CUDA/CUDNN versions and reinstalling the correct ones, it seems to have resolved the issue.
@sushreebarsa I am testing right now with CUDNN 8.1.1 and CUDA 11.0. I will report back shortly.
Actually, I had the correct driver version installed (the one in the link you sent). I confused the new driver update message with the version installed. I am still at 6 seconds even after uninstalling nvidia-440…
Same for me. 2.4 was core dumping and I had to install 2.3.1 but it was CPU version. If I get it to work with the GPU I will let you know.
But man, after all these months, finally a solution with the above problem! Today is the first time I can train with no issue for hours!
@ion-elgreco I’m not aware of a good reason how the training could become slower after enabling mixed precision… Initial graph building, yes.
Do I understand correctly that you’re also running TF2.3 right now inside this new desktop with a Ryzen CPU, and just switching between active CUDA version? So the TF2.3 is not crashing and TF2.4 does with the exact same hw and software conf.
Couldn’t rule out the HW issue as well, this one is from pytorch, but still relevant. https://discuss.pytorch.org/t/runtimeerror-cudnn-status-internal-error-when-l-run-the-program-for-a-second-time/2960/4
As there can be many causes for CUDNN_STATUS_INTERNAL_ERROR, the only pragmatic way to help here, is to also provide the dataset (processed enough to not cause the legal issues), that would be as small and compact as possible to reproduce the issue with TF2.4. Also include the affected version info etc. Might be worth opening as a separate issue, as the cause seems to be different from the one from @nectario.
I noticed a comment of using conda. To be 100% clean, I’d definitely try to reproduce this also with a clean venv install with pip. Smth like
I believe it’s been acknowledged but this issue is unique as it’s hard to reproduce. If they try to reproduce this on any cloud machine it will never happen. It requires these consumer based graphics cards. Like I have a Titan V or an RTX 3080 which you have. It’s good that we are keeping this thread “warm”
I hope someone from TensorFlow can update us here if they are investigating it on what is causing it.
My resolution for now: I reduced the number of epochs. When it fails, I load the last saved weights and continue on.