tensorflow: Masking LSTM: OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM

System information

Have I written custom code: Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04.2 LTS
TensorFlow installed from (source or binary): Binary, pip
TensorFlow version (use command below): v2.0.0-rc2-26-g64c3d38 2.0.0
Python version: Python 3.7.3
CUDA/cuDNN version: CUDA=10.0, CUDNN=7.6.2.24-1
GPU model and memory: Quadro RTX 6000 major: 7 minor: 5 memoryClockRate(GHz): 1.77

Describe the problem

It seems there is an issue with the CuDNN LSTM implementation when using a tf.keras.layers.Masking layer.

batch_size = 256
num_tsteps = 144
num_features = 130
num_units = 88
model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(num_tsteps, num_features), batch_size=batch_size),
    tf.keras.layers.Masking(mask_value=0.0, input_shape=(num_tsteps, num_features)),
    tf.keras.layers.LSTM(num_units,  batch_input_shape=(batch_size, num_tsteps, num_features), return_sequences=True, stateful=False),
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1)),
    tf.keras.layers.Activation('sigmoid'),
])

Similar to #33069 I receive this error during training and I have strictly right-padded data (I am doing trimming and right-padding manually). However, in contrast to this issue, I confirmed that I do not have any inputs containing only zeroes via the following snippet:

for i, e in enumerate(ds_train):
    res = []
    f, l = [x.numpy() for x in e]
    for j in range(f.shape[0]):
        if not (f[j] == 0.0).all():
            res.append(1)
        else:
            res.append(0)
    fin = [res[0]]
    for e in res[1:]:
        if e != fin[-1]:
            fin.append(e)
    print("i {}: {}".format(i, fin))

# Result:
i 0: [1, 0]
i 1: [1, 0]
i 2: [1, 0]
i 3: [1, 0]
i 4: [1]
i 5: [1, 0]
...

If I remove the Masking-layer, the error does not occur. I confirmed this by running a complete epoch (2324 batches), however, the training is probably pretty pointless when including the padded data.

Is there any other pitfall that I am missing that could cause this issue?

Source code / logs

Python output:

Epoch 1/1000
WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: 


CancelledErrorTraceback (most recent call last)
<ipython-input-7-1c503c2dd55c> in <module>
----> 1 m.fit(train=True)

/ws/tf/vol_local/_model_lstm.py in fit(self, train, verbose)
    315             ]
    316             self.model.fit(ds_train, epochs=num_epochs, verbose=verbose, shuffle=False,
--> 317                                 validation_data=ds_val, validation_steps=None, callbacks=cbs)
    318             #self.model.save(sess_hdf5_path)
    319             self.model.save_weights(self.sess_h5_path.as_posix())

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
    726         max_queue_size=max_queue_size,
    727         workers=workers,
--> 728         use_multiprocessing=use_multiprocessing)
    729 
    730   def evaluate(self,

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, **kwargs)
    322                 mode=ModeKeys.TRAIN,
    323                 training_context=training_context,
--> 324                 total_epochs=epochs)
    325             cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
    326 

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
    121         step=step, mode=mode, size=current_batch_size) as batch_logs:
    122       try:
--> 123         batch_outs = execution_function(iterator)
    124       except (StopIteration, errors.OutOfRangeError):
    125         # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in execution_function(input_fn)
     84     # `numpy` translates Tensors to values in Eager mode.
     85     return nest.map_structure(_non_none_constant_value,
---> 86                               distributed_function(input_fn))
     87 
     88   return execution_function

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py in __call__(self, *args, **kwds)
    455 
    456     tracing_count = self._get_tracing_count()
--> 457     result = self._call(*args, **kwds)
    458     if tracing_count == self._get_tracing_count():
    459       self._call_counter.called_without_tracing()

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py in _call(self, *args, **kwds)
    518         # Lifting succeeded, so variables are initialized and we can run the
    519         # stateless function.
--> 520         return self._stateless_fn(*args, **kwds)
    521     else:
    522       canon_args, canon_kwds = \

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py in __call__(self, *args, **kwargs)
   1821     """Calls a graph function specialized to the inputs."""
   1822     graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 1823     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
   1824 
   1825   @property

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py in _filtered_call(self, args, kwargs)
   1139          if isinstance(t, (ops.Tensor,
   1140                            resource_variable_ops.BaseResourceVariable))),
-> 1141         self.captured_inputs)
   1142 
   1143   def _call_flat(self, args, captured_inputs, cancellation_manager=None):

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1222     if executing_eagerly:
   1223       flat_outputs = forward_function.call(
-> 1224           ctx, args, cancellation_manager=cancellation_manager)
   1225     else:
   1226       gradient_name = self._delayed_rewrite_functions.register()

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py in call(self, ctx, args, cancellation_manager)
    509               inputs=args,
    510               attrs=("executor_type", executor_type, "config_proto", config),
--> 511               ctx=ctx)
    512         else:
    513           outputs = execute.execute_with_cancellation(

/ws/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     65     else:
     66       message = e.message
---> 67     six.raise_from(core._status_to_exception(e.code, message), None)
     68   except TypeError as e:
     69     keras_symbolic_tensors = [

/ws/miniconda3/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

CancelledError:  [_Derived_]RecvAsync is cancelled.
	 [[{{node metrics/accuracy/broadcast_weights/assert_broadcastable/AssertGuard/else/_36/Assert/data_2/_62}}]]
	 [[loss/activation_loss/weighted_loss/broadcast_weights/assert_broadcastable/is_valid_shape/else/_1/has_valid_nonscalar_shape/then/_106/has_invalid_dims/concat/_28]] [Op:__inference_distributed_function_172102]

Function call stack:
distributed_function

Command line log:

2019-10-08 14:38:27.367875: W tensorflow/core/grappler/optimizers/implementation_selector.cc:310] Skipping optimization due to error while loading function libraries: Invalid argument: Functions '__inference___backward_cudnn_lstm_with_fallback_169668_171093' and '__inference___backward_cudnn_lstm_with_fallback_169668_171093_specialized_for_StatefulPartitionedCall_at___inference_distributed_function_172102' both implement 'lstm_dce676f4-acdd-4bb5-88d9-e8dd57573aba' but their signatures do not match.
2019-10-08 14:38:27.536666: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-08 14:38:39.982582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-08 14:38:41.215567: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1424): 'cudnnSetRNNDataDescriptor( data_desc.get(), data_type, layout, max_seq_length, batch_size, data_size, seq_lengths_array, (void*)&padding_fill)'
2019-10-08 14:38:41.215616: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1424): 'cudnnSetRNNDataDescriptor( data_desc.get(), data_type, layout, max_seq_length, batch_size, data_size, seq_lengths_array, (void*)&padding_fill)'
	 [[{{node cond_64/then/_0/CudnnRNNV3}}]]
2019-10-08 14:38:41.215638: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Cancelled: [_Derived_]RecvAsync is cancelled.
	 [[{{node metrics/accuracy/broadcast_weights/assert_broadcastable/AssertGuard/else/_36/Assert/data_2/_62}}]]
	 [[loss/activation_loss/weighted_loss/broadcast_weights/assert_broadcastable/is_valid_shape/else/_1/has_valid_nonscalar_shape/then/_106/has_invalid_dims/concat/_28]]
2019-10-08 14:38:41.215693: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Cancelled: [_Derived_]RecvAsync is cancelled.
	 [[{{node metrics/accuracy/broadcast_weights/assert_broadcastable/AssertGuard/else/_36/Assert/data_2/_62}}]]

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 71 (24 by maintainers)

Commits related to this issue

Walk around the issue for fully masked input in cudnn kernel with LSTM/GRU. See https://github.com/tensorflow/tensorflow/issues/33148 for more details. This issue has been there for quite some time. ... — committed to tensorflow/tensorflow by qlzh727 4 years ago

Most upvoted comments

import tensorflow as tf tf.compat.v1.disable_eager_execution()

Dissable eager execution and everything is running fine without the fused rnn kernel. Thx for the help guys 😃

hucKOder on May 2, 2020

I changed my model from:

        self.model = Sequential([
            Embedding(len(self.item_map), self.embed_dim, input_length = X.shape[1],mask_zeros=True),
            LSTM(self.lstm_out),
            Dense(len(self.item_map)-1),
        ])

to:

        self.model = Sequential([
            Embedding(len(self.item_map), self.embed_dim, input_length = X.shape[1]),
            Masking(mask_value=0),
            LSTM(self.lstm_out),
            Dense(len(self.item_map)-1),
        ])

And solved my isssue

I know @mimxrt’s code has the same model and I dont know why it works for me, but im adding this for anyone else comes here with the issue and maybe it can help with debugging

ynsgnr on Nov 12, 2019

我将模型从以下位置更改：

        自我 .model =顺序（[
            嵌入（len（self .item_map），self .embed_dim，input_length  = X.shape [ 1 ]，mask_zeros = True），
            LSTM（self .lstm_out），
            密集（len个（自 .item_map）- 1）
        ]）

至：

        自我 .model =顺序（[
            嵌入（len（self .item_map），self .embed_dim，input_length  = X.shape [ 1 ]），
            遮罩（mask_value = 0），
            LSTM（self .lstm_out），
            密集（len个（自 .item_map）- 1）
        ]）

解决了我的问题

我知道@mimxrt的代码具有相同的模型，我不知道为什么它对我有用，但是我在这里为其他人添加了这个问题，也许它可以帮助调试

I changed my model from:
        self.model = Sequential([
            Embedding(len(self.item_map), self.embed_dim, input_length = X.shape[1],mask_zeros=True),
            LSTM(self.lstm_out),
            Dense(len(self.item_map)-1),
        ])
to:
        self.model = Sequential([
            Embedding(len(self.item_map), self.embed_dim, input_length = X.shape[1]),
            Masking(mask_value=0),
            LSTM(self.lstm_out),
            Dense(len(self.item_map)-1),
        ])
And solved my isssue

I know @mimxrt’s code has the same model and I dont know why it works for me, but im adding this for anyone else comes here with the issue and maybe it can help with debugging

I think embedding(mask_zero=true) create this problem,there are two way I find to slove it 1"mask_zero=False",but it changes code process 2 your way
thanks a lot.

inkcherry on Nov 22, 2019

I have a work around that seems to work: force TF to use the non CuDNN implementation by selecting a sigmoid activation instead of TANH

layers.LSTM(...,activation='sigmoid')

Outputs

WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn’t meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU

This forces TF to use a generic GPU kernel in place of CuDNN. It’s slower but a slower implementation is a lot faster than not working at all ;p

NoelKennedy on Jun 1, 2020

@houtoms, Thanks for providing the context.

From high level API’s perspective, I would expect the kernel to just return zeros for any of the sequence that is fully masked, rather than asking user to remove those values from the batch. It will be quite complicate to ask user to handle this on the python side.

@houtoms, will it be complicated to add this support (fully masked sequence) in the cudnn kernel?

qlzh727 on Nov 20, 2019

When will this release happen …looks like without CuDNN the implementation of LSTM runs very slow.

cudnn 8.0.5 is already out https://docs.nvidia.com/deeplearning/cudnn/archives/index.html

I tried with cudnn=8.0.4 cudatoolkit=11.0.221 tensorflow-gpu=2.4.0, and it fixed my problem. cudnn can be installed by: conda install -c nvidia cudnn

celi52 on Jan 29, 2021

I am still facing this issue using TF 2.2.0. I also found the same workaround of forcing the LSTM to not use the cuDNN implementation to work, however it is nearly prohibitively slow. I found the generic GPU implementation took ~30 times longer to train per epoch than the cuDNN version. I hope this can be fixed soon.

thefxperson on Jul 10, 2020

Yes, your understanding is correct. And cuDNN doesn’t support zero samples in batch for now.

I am not sure if it is possible/how to change the batch size during training using tf.dataset (@qlzh727 some tf.dataset experts?). Batch size 1 could work, but it might significantly affect the performance.

Or you could try to change the way you split your sequence into the batch, like using seq_len/batch_size as the current ‘num_tsteps’ rather than the fixed 144. But still you need to make sure the minimum seq_len >= batch_size.

@houtoms As it seems the implementation takes more time than expected I wanted to continue and try your suggestion of using a dynamic number of timesteps (instead of the fixed 144). Unfortunately I get an error when doing this in TF 2.1.0 (calling model.fit() with an input of <PrefetchDataset shapes: ((128, None, 128), (128, None, 1)), types: (tf.float32, tf.float32)>:

InvalidArgumentError: ValueError: Attempt to convert a value (<BatchDataset shapes: ((128, None, 128), (128, None, 1)), types: (tf.float32, tf.float32)>) with an unsupported type (<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>) to a Tensor.

Could it be that I misunderstood your suggestion (note the None in the input dimensions)? All ideas are much appreciated.

mimxrt on Feb 12, 2020

I can reproduce that mask_zeros=True is causing the crash. Doesn’t matter if eager is on or off, with or without callbacks.

LSTM on a masked sequence is extremely common in NLP models, so this is a major bug in terms of impact.

alldefector on Dec 8, 2019