tensorflow: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_INTERNAL_ERROR

I am unable to train my model. I get the below error after around the 100th epoch or so (randomly). Sometimes it fails on the 500th or so epoch. There are a few times I won’t get this error.

Python Version: 3.7 Tesorflow Version: 2.3 Tensorflow Installed: From binary. OS: Windows 10 GPU Card: Nvidia Titan V CUDA: 10.1 CUDNN: 7.6.5.32

: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2020-07-07 17:15:37.551112: E tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_INTERNAL_ERROR in tensorflow/stream_executor/cuda/cuda_dnn.cc(1867): ‘cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())’ 2020-07-07 17:15:37.551176: F tensorflow/stream_executor/cuda/cuda_dnn.cc:189] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
  • TensorFlow version (use command below):
  • Python version:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 122 (18 by maintainers)

Most upvoted comments

Probably NVIDIA needs to be involved also for this to be fixed.

I think I may have finally resolved this issue. It seems I had two CUDA versions installed: 10.1, and 10.2. My installation was pointing to the 10.2 version. After uninstalling all the other other versions and having only 10.1 I was able to finally train more than once without issues. I will run a couple more tests. Bottom line: After uninstalling all my previous CUDA/CUDNN versions and reinstalling the correct ones, it seems to have resolved the issue.

@nectario Could please check the tested build configurations and let us know if it helps? Thank you!

@sushreebarsa I am testing right now with CUDNN 8.1.1 and CUDA 11.0. I will report back shortly.

@ion-elgreco

  1. core dump.

Could you maybe explain in detail, which OS build, driver versions etc you needed to get it to work?

Here is the info: Build: 10.0.21277 Build 21277 Nvidia Driver Version: 460.89 (Studio) As for the commands, I ran a bunch but I think these made it work:

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/nvidia-machine-learning-repo-ubuntu2004_1.0.0-1_amd64.deb

sudo apt install ./nvidia-machine-learning-repo-ubuntu2004_1.0.0-1_amd64.deb

sudo apt install nvidia-cuda-toolkit

download cudnn 7.65 for ubuntu 18.04 (there is no 20.04 version and you need to login and manually download it):

https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.1_20191031/Ubuntu18_04-x64/libcudnn7_7.6.5.32-1%2Bcuda10.1_amd64.deb

sudo apt install ./libcudnn7_7.6.5.32-1%2Bcuda10.1_amd64.deb


sudo apt install nvidia-driver-440

type nvidia-smi to verify

Make sure you install tensorflow 3.1

Hmm interesting, because you are installing the Nvidia driver inside the WSL ubuntu env. According to env to properly use your GPU is to install the driver only on your main OS and not inside the WSL ubuntu. Maybe that’s why it’s slower? I’ll give your commands a try later this week! Thanks.

Maybe you are right. I was having a ton of issues before that. I will try uninstalling it and see what happens.

Only the developer driver version is supported according to Nvidia. https://developer.nvidia.com/cuda/wsl/download

Actually, I had the correct driver version installed (the one in the link you sent). I confused the new driver update message with the version installed. I am still at 6 seconds even after uninstalling nvidia-440…

Same for me. 2.4 was core dumping and I had to install 2.3.1 but it was CPU version. If I get it to work with the GPU I will let you know.

But man, after all these months, finally a solution with the above problem! Today is the first time I can train with no issue for hours!

@ion-elgreco I’m not aware of a good reason how the training could become slower after enabling mixed precision… Initial graph building, yes.

Do I understand correctly that you’re also running TF2.3 right now inside this new desktop with a Ryzen CPU, and just switching between active CUDA version? So the TF2.3 is not crashing and TF2.4 does with the exact same hw and software conf.

Couldn’t rule out the HW issue as well, this one is from pytorch, but still relevant. https://discuss.pytorch.org/t/runtimeerror-cudnn-status-internal-error-when-l-run-the-program-for-a-second-time/2960/4

As there can be many causes for CUDNN_STATUS_INTERNAL_ERROR, the only pragmatic way to help here, is to also provide the dataset (processed enough to not cause the legal issues), that would be as small and compact as possible to reproduce the issue with TF2.4. Also include the affected version info etc. Might be worth opening as a separate issue, as the cause seems to be different from the one from @nectario.

I noticed a comment of using conda. To be 100% clean, I’d definitely try to reproduce this also with a clean venv install with pip. Smth like

cd projdir
\python38\python.exe -m venv .venv
.\venv\Scripts\activate or .\venv\Scripts\Activate.ps1 for powershell
pip install -U pip
pip install tensorflow
python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])));print(tf.config.list_physical_devices('GPU'))"
python yourtrainging.py

I believe it’s been acknowledged but this issue is unique as it’s hard to reproduce. If they try to reproduce this on any cloud machine it will never happen. It requires these consumer based graphics cards. Like I have a Titan V or an RTX 3080 which you have. It’s good that we are keeping this thread “warm”

I hope someone from TensorFlow can update us here if they are investigating it on what is causing it.

My resolution for now: I reduced the number of epochs. When it fails, I load the last saved weights and continue on.