tensorflow: GPU-accelerated LSTMs crash randomly with: [ InternalError: [_Derived_] Failed to call ThenRnnBackward with model config ]

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro N, Build 17763
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): Pypi
  • TensorFlow version (use command below): v2.1.0-rc2-17-ge5bf8de410 2.1.0
  • Python version: 3.7.6
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version: CUDA 10.1, cudnn-10.1-windows10-x64-v7.6.5.32
  • GPU model and memory: GTX 1060, 6 GB

Describe the current behavior

Dear Tensorflow-Developers,

my jupyter notebook that is training some LSTMs on the GPU crashes after some time with the following traceback:

InternalError:  [_Derived_]  Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 100, 100, 1, 249, 32, 100] 
	 [[{{node gradients/CudnnRNN_grad/CudnnRNNBackprop}}]]
	 [[StatefulPartitionedCall_1]] [Op:__inference_distributed_function_7604]

Function call stack:
distributed_function -> distributed_function -> distributed_function

This crash happens after a random amount of epochs (sometimes 6, sometimes 130+ sometimes 300+). It also crashes on different Windows machines with different GPUs.

Please see this minimal notebook to reproduce the behaviour that also includes the whole stacktrace: https://gist.github.com/jliebers/995c3c4da4ad2a6f9376d31ee2470ec5

In the stacktrace I can find the following line: 130 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?

I wonder if this is connected to this issue? 🙂

On a CPU-training everything works well and stable.

Thank you kindly in advance for your consideration and great work. 🚀

Describe the expected behavior

The GPU-accelerated LSTM should not crash randomly.

Standalone code to reproduce the issue

https://gist.github.com/jliebers/995c3c4da4ad2a6f9376d31ee2470ec5

Other info / logs

For the full traceback, please check the gist from above.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 9
  • Comments: 59 (9 by maintainers)

Most upvoted comments

Hi,

so the following workaround has been found thanks to @DietmarKracht. Now I am able to train LSTMs on my GPU without the error from the first post in this issue. 🎉 🎉 🎉

To train the LSTM-model on a GPU on my plattform (Windows 10, tf 2.1.0), the parameter batch_input_shape must be specified during the model creation in the very first layer and the parameter input_shape must be omitted. The first layer of the LSTM-model should looks this:

model = Sequential()
model.add(LSTM(100, 
    batch_input_shape=(batch_size, n_timesteps, n_features), 
    return_sequences=True))  # omit return_sequences if no other LSTM-layer follows
[...]

Notice: I assume that if batch_input_shape is not specified it will default to some value and this issue arises randomly from the consequences. As it could not be reproduced in colab, I guess that it is a platform-specific platform (see first post for my specs).

Important: An int, batch_size, is specified in batch_input_shape=(batch_size, n_timesteps, n_features). The int must divide divide the length of X (X is passed to model.fit()) without any rest, i.e. len(X) % batch_size == 0! Additionally one must not use the validation_split-parameter in model.fit() (see issue #37840).

Then it works without any issues with tf 2.1.0 on Windows 10 (finally! phew!). 🙂

Please find a minimal example (works for me on GPU and CPU) here: https://gist.github.com/jliebers/7effb38e836ab3c6e95bd122589f5f92

Sadly, it is nowhere mentioned in the documentation and it took us a week to solve this issue. I hope this post is helpful for people in the future.

Update:

Should the kernel die at any point with the following trace, then lower your batch_size to a smaller divisor of len(x):

2020-03-27 15:01:57.982960: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-03-27 15:01:57.983072: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

same here: “InternalError” on Windows and Ubuntu 18.04 on LSTM layers

Since I switched to Linux (Ubuntu 18.04) I never had a problem with this issue anymore. Therefore, I highly recommend anyone to switch OS if you want GPU-accelerated LSTMs. It is 100% connected to Windows only and I did not find any working solution (workaround or such) for Windows.

Only solution I know is to switch to Linux. The problem then disappears.

I mean this issue was closed, once I stated that I switched my OS but the underlying problem still exists. 🤷‍♂️

I am not sure if this is actually the cause. I use window 10, TF 2.2. I had this problem as well. I did the recommended fix like setting batch input shape and letting GPU grow and so on, but it didn’t work for me.

But when I turned off antivirus(McAfee_real-time scan), it’s been working really good. It had been 3 days since I turned it off and I have not encountered this error again since. It sounds weird and stupid but I think it’s worth the shot.

After following the solutions suggested like:

  • Allowing GPU Memory Growth
  • Using batch_input_shape instead of input_shape
  • Using drop_remainder=True when creating batches

I’m faced with the following after the first successfully trained model:

2020-03-27 17:21:28.275596: F .\tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (0 vs. 0)
[I 17:21:29.843 LabApp] KernelRestarter: restarting kernel (1/5), keep random ports
kernel 2bac517a-c195-47ca-952b-c25881cf0757 restarted

I confirm that after updating the NVidia driver to version 461.40, the problem with LSTM layers on Windows 10 disappeared. My config is:

  • Windows 10 Pro build 2004
  • GTX970 + NVidia Game Ready Driver 461.40
  • CUDA 10.1
  • CUDNN 7.6.5
  • Python 3.7.7
  • tensorlfow-gpu 2.3.0

I share my experience with the same problem:

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Enterprise, Build 2004 Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No TensorFlow installed from (source or binary): Pypi TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0 Python version: 3.7.6 Bazel version (if compiling from source): - GCC/Compiler version (if compiling from source): - CUDA/cuDNN version: CUDA 10.1 (10.1.243), cudnn-10.1-windows10-x64-v7.6.5.32 GPU model and memory: Quadro RTX 4000, 8 GB (Laptop version)

I got similar error while testing different NVIDIA driver versions (ranging from 441.66 to the 456.38) that were available on NVIDIA’s site. None of these driver versions could fix the problem where the training crashes after the first epoch in the middle of the second one, or somewhere in-between.

1st workaround

One workaround that seems to work (I could get the training to finish) was following tips from https://github.com/tensorflow/tensorflow/issues/37942 where I had to specify a fixed batch_size on the first layer of the model:

x = Input(shape=(timesteps,input_dim), batch_size=64) # need to have fixed batch_size for cudnn Rnns to work on Windows
...

This alone was not enough (the training still crashed randomly at some point, however, it got sometimes a bit further in the training epochs). I had to also specify

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)

and ensure that that the input x given to the model in the model.fit() method must be divisible by the batch_size (there must be no incomplete batch with less than batch_size samples at the end), again following https://github.com/tensorflow/tensorflow/issues/37942. After that, the training did not crash on a couple of run attempts.

However, this is only a workaround, which is quite annoying to do since it requires to add code that is needed only because of the Windows-related cuDNN bug.

2nd workaround - downgrading driver to 431.86

Multiple issues here (https://github.com/tensorflow/tensorflow/issues/41863, https://github.com/tensorflow/tensorflow/issues/41444) and on internet (https://forums.developer.nvidia.com/t/cudnn-lstm-is-broken-above-driver-431-60-unexpected-event-status-1-cuda/108800) related to this problem mention that the problem disappears if one fallbacks the NVIDIA driver version to 431.86. This version is not officially supported on my gpu (Quadro RTX 4000 notebook version), and NVIDIA does not directly even offer this specific version for this gpu (earliest available is 441.66). However, I still managed to install the unsupported version (found using some internet searches directly from NVIDIA’s download repository), and the model training seems to work with this old, unsupported 431.86 driver version.

My other failed attempt - Tensorflow 2.3 compiled against CUDA 11 / cuDNN 8

I also tested to install Tensorflow 2.3 compiled for CUDA11.0 / cuDNN 8.0.2 from an unofficial wheel from here https://github.com/fo40225/tensorflow-windows-wheel. I had the specific CUDA 11 and cuDNN version installed and this Tensorflow 2.3 version compiled against these. In addition, I again tried all available NVIDIA driver versions versions (ranging from 441.66 to the 456.38) but I got the same error, so it seems that the problem cannot be solved by moving to a newer CUDA / cuDNN version.

This is not a permanent solution, but I managed to make it work again by downgrading the NVIDIA driver to the last stable studio driver (431.86) as suggested here: https://forums.developer.nvidia.com/t/cudnn-lstm-is-broken-above-driver-431-60-unexpected-event-status-1-cuda/108800/2

You need to first download the corresponding studio driver from NVIDIA, then uninstall the whatever driver version you have now (in my case 442), then install the 431.86 again. This is trickier than it sounds, as NVIDIA utilities only allow you to downgrade to the previous version, and my case I was several versions ahead.

I ended up using DDU utility as suggested in other forums to wipe the previous driver from my machine, it did the job nicely (no safe mode was necessary).

Also, bear in mind that windows will try to automatically update the driver as soon as it gets a chance (creating the problem again). To avoid this you can disable automatic updates for your drivers following these instructions.

By the way, it wasn’t necessary to apply the previous fixes suggested in the post (set batch_size or memory growth), just downgrading the driver did the trick.

I hope this helps, I wasted several hours trying to make it work!

I took daviddiazsolis’ advice and downgraded the driver to version 431.86. This was a 100% solution for me.

I have been struggling with this issue for a while and have tried most or all of the other suggestions made in this thread without success. After downgrading the driver there has not been a single “Failed to call ThenRnnBackward with model config”-error.

I had the same issue on Windows, in my case the error took place only when running Bidirectional LSTM on GPU with a small batch size. As you’ve experienced the error doesn’t show up running on CPU. I managed to find a temporary solution by running the code on Colab with GPU enabled. On Colab the error doesn’t persist so as you said it’s an error related to Windows OS.

I have read that some people believe that this is a memory issue. I doubt it is. I tried a CNN model with the same data, not even restricting the batch size and with 4 times the parameters and the model ran extremely smoothly.

image image image

(There are many things I don’t know about the tensorflow LSTM implementation, but I don’t understand how with quarter of the parameters, smaller batch size and the same data the kernel dies. To me this clearly shows a problem with LSTM under these versions and Windows.)

Hi, I have this issue since yesterday. Before that, everything was working fine. I have not updated either the tensorflow version or the gpu driver or anything else for that matter in the last couple of days. This issue suddenly appeared just yesterday on its own. Below is my model,

def neural_network(vocab_size, embedding_dim, max_length, train_padded, train_labels, validation_frac, num_epochs):
    model = Sequential()
    model.add(Embedding(vocab_size, embedding_dim, input_length = max_length))
    model.add(Bidirectional(LSTM(64, return_sequences = True)))
    model.add(GlobalAveragePooling1D())
    model.add(Dropout(0.2))
    model.add(Dense(50, activation = 'relu'))
    model.add(Dropout(0.1))
    model.add(Dense(1, activation = 'sigmoid'))
    model.summary()
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    history = model.fit(train_padded, train_labels, epochs = num_epochs, verbose = 2, validation_split = validation_frac)
    return model, history

I am pretty sure I am not running out of memory as previously I have trained even bigger models (~20M params) but the above model has just 2M elements. Even the GPU usage barely exceeds 5%. I have GTX 1050ti 4GB. Also, I have successfully run the above model plenty of times before but only since yesterday this issue is coming up.

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 200, 128)          1920000   
_________________________________________________________________
bidirectional (Bidirectional (None, 200, 128)          98816     
_________________________________________________________________
global_average_pooling1d (Gl (None, 128)               0         
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 50)                6450      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
=================================================================
Total params: 2,025,317
Trainable params: 2,025,317
Non-trainable params: 0

Below is the exact error.

Train on 143613 samples, validate on 15958 samples
Epoch 1/5
Traceback (most recent call last):

  File "C:\Users\admin\Documents\Machine Learning\Projects\Classification\jigsaw-toxic-comment-classification-challenge\toxic_classifier.py", line 122, in <module>
    model, history = neural_network(vocab_size, embedding_dim, max_length, train_padded, toxicity[col], validation_frac, num_epochs)

  File "C:\Users\admin\Documents\Machine Learning\Projects\Classification\jigsaw-toxic-comment-classification-challenge\toxic_classifier.py", line 87, in neural_network
    history = model.fit(train_padded, train_labels, epochs = num_epochs, verbose = 2, validation_split = validation_frac)

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
    total_epochs=epochs)

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
    self.captured_inputs)

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
    ctx=ctx)

  File "C:\Users\admin\Anaconda3\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)

  File "<string>", line 3, in raise_from

InternalError:  [_Derived_]  Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 128, 64, 1, 200, 32, 64] 
	 [[{{node gradients/CudnnRNN_grad/CudnnRNNBackprop}}]]
	 [[StatefulPartitionedCall_1]]
	 [[Reshape_14/_46]] [Op:__inference_distributed_function_5894]

Function call stack:
distributed_function -> distributed_function -> distributed_function

Also, once this issue occurs, the kernel keeps crashing on its own even if I am not compiling anything. I am using Spyder IDE and even restarting the kernel does not help; it simply crashes after a few seconds. Below is the log for that (it is written on a red background)

An error ocurred while starting the kernel
2020󈚨󈚧 09:10:48.429373: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020󈚨󈚧 10:25:24.630422: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020󈚨󈚧 10:25:24.654232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.392GHz coreCount: 6 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 104.43GiB/s
2020󈚨󈚧 10:25:24.655330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020󈚨󈚧 10:25:24.660386: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020󈚨󈚧 10:25:24.665212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020󈚨󈚧 10:25:24.667487: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020󈚨󈚧 10:25:24.672356: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020󈚨󈚧 10:25:24.675255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020󈚨󈚧 10:25:24.685248: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020󈚨󈚧 10:25:24.686442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020󈚨󈚧 10:25:24.687128: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2020󈚨󈚧 10:25:24.689769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.392GHz coreCount: 6 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 104.43GiB/s
2020󈚨󈚧 10:25:24.690853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020󈚨󈚧 10:25:24.691406: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020󈚨󈚧 10:25:24.691954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020󈚨󈚧 10:25:24.692497: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020󈚨󈚧 10:25:24.693046: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020󈚨󈚧 10:25:24.693600: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020󈚨󈚧 10:25:24.694155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020󈚨󈚧 10:25:24.695292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020󈚨󈚧 10:25:25.261369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020󈚨󈚧 10:25:25.261990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 
2020󈚨󈚧 10:25:25.262349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N 
2020󈚨󈚧 10:25:25.263472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2990 MB memory) ‑> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:26:00.0, compute capability: 6.1)
2020󈚨󈚧 10:25:27.808120: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020󈚨󈚧 10:25:28.068884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020󈚨󈚧 10:26:06.218534: E tensorflow/stream_executor/dnn.cc:596] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1921): 'cudnnRNNBackwardData( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, output_desc.handles(), output_data.opaque(), output_desc.handles(), output_backprop_data.opaque(), output_h_desc.handle(), output_h_backprop_data.opaque(), output_c_desc.handle(), output_c_backprop_data.opaque(), rnn_desc.params_handle(), params.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), input_desc.handles(), input_backprop_data‑>opaque(), input_h_desc.handle(), input_h_backprop_data‑>opaque(), input_c_desc.handle(), input_c_backprop_data‑>opaque(), workspace.opaque(), workspace.size(), reserve_space_data‑>opaque(), reserve_space_data‑>size())'
2020󈚨󈚧 10:26:06.221888: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1922 : Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 128, 64, 1, 200, 32, 64] 
2020󈚨󈚧 10:26:06.223371: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 128, 64, 1, 200, 32, 64] 
[[{{node gradients/CudnnRNN_grad/CudnnRNNBackprop}}]]
2020󈚨󈚧 10:26:06.225063: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: {{function_node __inference___backward_cudnn_lstm_with_fallback_4410_4588_specialized_for_StatefulPartitionedCall_1_at___inference_distributed_function_5894}} {{function_node __inference___backward_cudnn_lstm_with_fallback_4410_4588_specialized_for_StatefulPartitionedCall_1_at___inference_distributed_function_5894}} Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 128, 64, 1, 200, 32, 64] 
[[{{node gradients/CudnnRNN_grad/CudnnRNNBackprop}}]]
[[StatefulPartitionedCall_1]]
[[Reshape_14/_46]]
2020󈚨󈚧 10:26:06.228126: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: {{function_node __inference___backward_cudnn_lstm_with_fallback_4410_4588_specialized_for_StatefulPartitionedCall_1_at___inference_distributed_function_5894}} {{function_node __inference___backward_cudnn_lstm_with_fallback_4410_4588_specialized_for_StatefulPartitionedCall_1_at___inference_distributed_function_5894}} Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 128, 64, 1, 200, 32, 64] 
[[{{node gradients/CudnnRNN_grad/CudnnRNNBackprop}}]]
[[StatefulPartitionedCall_1]]
2020󈚨󈚧 10:35:24.183417: F .\tensorflow/core/kernels/random_op_gpu.h:232] Non‑OK‑status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: unspecified launch failure

To get rid of the recurrent kernel crashes I have to restart Spyder every time. None of this had ever occurred before yesterday and I can say for sure I have not updated anything in the last 1 month at least. my TF version is 2.1 and the GPU driver version is 441.22.

HEY, have you solved this problem with the soluion mentioned above by jlieber? i have crossed the same problem with you

Hi, no I did not use the solution mentioned by jlieber as I use windows 10 and do not want to switch my OS. Like I mentioned earlier in this thread, the problem goes away if you restart Spyder. If you are working in a Jupyter notebook, just restart Jupyter itself. It is a bit irritating but this is the shortest workaround. However I still wonder what suddenly caused this issue to come up all of a sudden after using TF for 3+ months

Hi,

so the following workaround has been found thanks to @DietmarKracht. Now I am able to train LSTMs on my GPU without the error from the first post in this issue. 🎉 🎉 🎉

To train the LSTM-model on a GPU on my plattform (Windows 10, tf 2.1.0), the parameter batch_input_shape must be specified during the model creation in the very first layer and the parameter input_shape must be omitted. The first layer of the LSTM-model should looks this:

model = Sequential()
model.add(LSTM(100, 
    batch_input_shape=(batch_size, n_timesteps, n_features), 
    return_sequences=True))  # omit return_sequences if no other LSTM-layer follows
[...]

Notice: I assume that if batch_input_shape is not specified it will default to some value and this issue arises randomly from the consequences. As it could not be reproduced in colab, I guess that it is a platform-specific platform (see first post for my specs).

Important: An int, batch_size, is specified in batch_input_shape=(batch_size, n_timesteps, n_features). The int must divide divide the length of X (X is passed to model.fit()) without any rest, i.e. len(X) % batch_size == 0! Additionally one must not use the validation_split-parameter in model.fit() (see issue #37840).

Then it works without any issues with tf 2.1.0 on Windows 10 (finally! phew!). 🙂

Please find a minimal example (works for me on GPU and CPU) here: https://gist.github.com/jliebers/7effb38e836ab3c6e95bd122589f5f92

Sadly, it is nowhere mentioned in the documentation and it took us a week to solve this issue. I hope this post is helpful for people in the future.

Update:

Should the kernel die at any point with the following trace, then lower your batch_size to a smaller divisor of len(x):

2020-03-27 15:01:57.982960: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-03-27 15:01:57.983072: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

Really helps a lot! Thanks.

Yes, allowing GPU memory growth is also necessary, i.e.

gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices: tf.config.experimental.set_memory_growth(device, True)

And I agree with the bullet points you posted. They are required to make it work on my setup.