tensorflow: GPU-accelerated LSTMs crash randomly with: [ InternalError: [_Derived_] Failed to call ThenRnnBackward with model config ]
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro N, Build 17763
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
- TensorFlow installed from (source or binary): Pypi
- TensorFlow version (use command below): v2.1.0-rc2-17-ge5bf8de410 2.1.0
- Python version: 3.7.6
- Bazel version (if compiling from source): -
- GCC/Compiler version (if compiling from source): -
- CUDA/cuDNN version: CUDA 10.1, cudnn-10.1-windows10-x64-v7.6.5.32
- GPU model and memory: GTX 1060, 6 GB
Describe the current behavior
Dear Tensorflow-Developers,
my jupyter notebook that is training some LSTMs on the GPU crashes after some time with the following traceback:
InternalError: [_Derived_] Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 100, 100, 1, 249, 32, 100]
[[{{node gradients/CudnnRNN_grad/CudnnRNNBackprop}}]]
[[StatefulPartitionedCall_1]] [Op:__inference_distributed_function_7604]
Function call stack:
distributed_function -> distributed_function -> distributed_function
This crash happens after a random amount of epochs (sometimes 6, sometimes 130+ sometimes 300+). It also crashes on different Windows machines with different GPUs.
Please see this minimal notebook to reproduce the behaviour that also includes the whole stacktrace: https://gist.github.com/jliebers/995c3c4da4ad2a6f9376d31ee2470ec5
In the stacktrace I can find the following line:
130 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?
I wonder if this is connected to this issue? 🙂
On a CPU-training everything works well and stable.
Thank you kindly in advance for your consideration and great work. 🚀
Describe the expected behavior
The GPU-accelerated LSTM should not crash randomly.
Standalone code to reproduce the issue
https://gist.github.com/jliebers/995c3c4da4ad2a6f9376d31ee2470ec5
Other info / logs
For the full traceback, please check the gist from above.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 9
- Comments: 59 (9 by maintainers)
Hi,
so the following workaround has been found thanks to @DietmarKracht. Now I am able to train LSTMs on my GPU without the error from the first post in this issue. 🎉 🎉 🎉
To train the LSTM-model on a GPU on my plattform (Windows 10, tf 2.1.0), the parameter
batch_input_shapemust be specified during the model creation in the very first layer and the parameterinput_shapemust be omitted. The first layer of the LSTM-model should looks this:Notice: I assume that if
batch_input_shapeis not specified it will default to some value and this issue arises randomly from the consequences. As it could not be reproduced in colab, I guess that it is a platform-specific platform (see first post for my specs).Important: An int,
batch_size, is specified inbatch_input_shape=(batch_size, n_timesteps, n_features). The int must divide divide the length ofX(Xis passed tomodel.fit()) without any rest, i.e.len(X) % batch_size == 0! Additionally one must not use thevalidation_split-parameter in model.fit() (see issue #37840).Then it works without any issues with tf 2.1.0 on Windows 10 (finally! phew!). 🙂
Please find a minimal example (works for me on GPU and CPU) here: https://gist.github.com/jliebers/7effb38e836ab3c6e95bd122589f5f92
Sadly, it is nowhere mentioned in the documentation and it took us a week to solve this issue. I hope this post is helpful for people in the future.
Update:
Should the kernel die at any point with the following trace, then lower your batch_size to a smaller divisor of len(x):
same here: “InternalError” on Windows and Ubuntu 18.04 on LSTM layers
Since I switched to Linux (Ubuntu 18.04) I never had a problem with this issue anymore. Therefore, I highly recommend anyone to switch OS if you want GPU-accelerated LSTMs. It is 100% connected to Windows only and I did not find any working solution (workaround or such) for Windows.
Only solution I know is to switch to Linux. The problem then disappears.
I mean this issue was closed, once I stated that I switched my OS but the underlying problem still exists. 🤷♂️
I am not sure if this is actually the cause. I use window 10, TF 2.2. I had this problem as well. I did the recommended fix like setting batch input shape and letting GPU grow and so on, but it didn’t work for me.
But when I turned off antivirus(McAfee_real-time scan), it’s been working really good. It had been 3 days since I turned it off and I have not encountered this error again since. It sounds weird and stupid but I think it’s worth the shot.
After following the solutions suggested like:
batch_input_shapeinstead ofinput_shapedrop_remainder=Truewhen creating batchesI’m faced with the following after the first successfully trained model:
I confirm that after updating the NVidia driver to version 461.40, the problem with LSTM layers on Windows 10 disappeared. My config is:
I share my experience with the same problem:
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Enterprise, Build 2004 Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No TensorFlow installed from (source or binary): Pypi TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0 Python version: 3.7.6 Bazel version (if compiling from source): - GCC/Compiler version (if compiling from source): - CUDA/cuDNN version: CUDA 10.1 (10.1.243), cudnn-10.1-windows10-x64-v7.6.5.32 GPU model and memory: Quadro RTX 4000, 8 GB (Laptop version)
I got similar error while testing different NVIDIA driver versions (ranging from 441.66 to the 456.38) that were available on NVIDIA’s site. None of these driver versions could fix the problem where the training crashes after the first epoch in the middle of the second one, or somewhere in-between.
1st workaround
One workaround that seems to work (I could get the training to finish) was following tips from https://github.com/tensorflow/tensorflow/issues/37942 where I had to specify a fixed batch_size on the first layer of the model:
This alone was not enough (the training still crashed randomly at some point, however, it got sometimes a bit further in the training epochs). I had to also specify
and ensure that that the input x given to the model in the model.fit() method must be divisible by the
batch_size(there must be no incomplete batch with less thanbatch_sizesamples at the end), again following https://github.com/tensorflow/tensorflow/issues/37942. After that, the training did not crash on a couple of run attempts.However, this is only a workaround, which is quite annoying to do since it requires to add code that is needed only because of the Windows-related cuDNN bug.
2nd workaround - downgrading driver to 431.86
Multiple issues here (https://github.com/tensorflow/tensorflow/issues/41863, https://github.com/tensorflow/tensorflow/issues/41444) and on internet (https://forums.developer.nvidia.com/t/cudnn-lstm-is-broken-above-driver-431-60-unexpected-event-status-1-cuda/108800) related to this problem mention that the problem disappears if one fallbacks the NVIDIA driver version to 431.86. This version is not officially supported on my gpu (Quadro RTX 4000 notebook version), and NVIDIA does not directly even offer this specific version for this gpu (earliest available is 441.66). However, I still managed to install the unsupported version (found using some internet searches directly from NVIDIA’s download repository), and the model training seems to work with this old, unsupported 431.86 driver version.
My other failed attempt - Tensorflow 2.3 compiled against CUDA 11 / cuDNN 8
I also tested to install Tensorflow 2.3 compiled for CUDA11.0 / cuDNN 8.0.2 from an unofficial wheel from here https://github.com/fo40225/tensorflow-windows-wheel. I had the specific CUDA 11 and cuDNN version installed and this Tensorflow 2.3 version compiled against these. In addition, I again tried all available NVIDIA driver versions versions (ranging from 441.66 to the 456.38) but I got the same error, so it seems that the problem cannot be solved by moving to a newer CUDA / cuDNN version.
I took daviddiazsolis’ advice and downgraded the driver to version 431.86. This was a 100% solution for me.
I have been struggling with this issue for a while and have tried most or all of the other suggestions made in this thread without success. After downgrading the driver there has not been a single “Failed to call ThenRnnBackward with model config”-error.
I had the same issue on Windows, in my case the error took place only when running Bidirectional LSTM on GPU with a small batch size. As you’ve experienced the error doesn’t show up running on CPU. I managed to find a temporary solution by running the code on Colab with GPU enabled. On Colab the error doesn’t persist so as you said it’s an error related to Windows OS.
I have read that some people believe that this is a memory issue. I doubt it is. I tried a CNN model with the same data, not even restricting the batch size and with 4 times the parameters and the model ran extremely smoothly.
(There are many things I don’t know about the tensorflow LSTM implementation, but I don’t understand how with quarter of the parameters, smaller batch size and the same data the kernel dies. To me this clearly shows a problem with LSTM under these versions and Windows.)
Hi, no I did not use the solution mentioned by jlieber as I use windows 10 and do not want to switch my OS. Like I mentioned earlier in this thread, the problem goes away if you restart Spyder. If you are working in a Jupyter notebook, just restart Jupyter itself. It is a bit irritating but this is the shortest workaround. However I still wonder what suddenly caused this issue to come up all of a sudden after using TF for 3+ months
Really helps a lot! Thanks.
Yes, allowing GPU memory growth is also necessary, i.e.
And I agree with the bullet points you posted. They are required to make it work on my setup.