tensorflow: CancelledError: [_Derived_]RecvAsync is cancelled. [[{{node Reshape_17/_52}}]] [[GroupCrossDeviceControlEdges_0/RMSprop/RMSprop/Const/_57]] [Op:__inference_distributed_function_24912]
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Home
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NaN
- TensorFlow installed from (source or binary): pip
- TensorFlow version (use command below): 2.0.0
- Python version: 3.7.4
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: v10.1
- GPU model and memory: GTX 1060 6GB
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
During fitting the data, it gives Cancelled Error in the very first batch
Describe the expected behavior
To “Fit” the model without error
Code to reproduce the issue
max_len_text=275
max_len_summary=28
from keras import backend as K
K.clear_session()
latent_dim = 500
encoder_inputs = Input(shape=(max_len_text,))
enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
attn_layer = AttentionLayer(name='attention_layer')
attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
decoder_outputs = decoder_dense(decoder_concat_input)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)
Model_training=model.fit([X_train,Y_train[:,:-1]], Y_train.reshape(Y_train.shape[0],Y_train.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=256, validation_data=([X_test,Y_test[:,:-1]],Y_test.reshape(Y_test.shape[0],Y_test.shape[1], 1)[:,1:]))
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
Train on 314860 samples, validate on 78716 samples
Epoch 1/50
256/314860 […] - ETA: 5:22WARNING:tensorflow:Early stopping conditioned on metric val_loss
which is not available. Available metrics are:
CancelledError Traceback (most recent call last) <ipython-input-30-8fb3a6c938b7> in <module> 1 Model_training=model.fit([X_train,Y_train[:,:-1]], Y_train.reshape(Y_train.shape[0],Y_train.shape[1], 1)[:,1:] 2 ,epochs=50,callbacks=[es],batch_size=256, validation_data=([X_test,Y_test[:,:-1]], ----> 3 Y_test.reshape(Y_test.shape[0],Y_test.shape[1], 1)[:,1:]))
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs) 726 max_queue_size=max_queue_size, 727 workers=workers, –> 728 use_multiprocessing=use_multiprocessing) 729 730 def evaluate(self,
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, **kwargs) 322 mode=ModeKeys.TRAIN, 323 training_context=training_context, –> 324 total_epochs=epochs) 325 cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN) 326
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs) 121 step=step, mode=mode, size=current_batch_size) as batch_logs: 122 try: –> 123 batch_outs = execution_function(iterator) 124 except (StopIteration, errors.OutOfRangeError): 125 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py in execution_function(input_fn)
84 # numpy
translates Tensors to values in Eager mode.
85 return nest.map_structure(_non_none_constant_value,
—> 86 distributed_function(input_fn))
87
88 return execution_function
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py in call(self, *args, **kwds) 455 456 tracing_count = self._get_tracing_count() –> 457 result = self._call(*args, **kwds) 458 if tracing_count == self._get_tracing_count(): 459 self._call_counter.called_without_tracing()
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py in _call(self, *args, **kwds) 485 # In this case we have created variables on the first call, so we run the 486 # defunned version which is guaranteed to never create variables. –> 487 return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable 488 elif self._stateful_fn is not None: 489 # Release the lock early so that multiple threads can perform the call
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, *args, **kwargs) 1821 “”“Calls a graph function specialized to the inputs.”“” 1822 graph_function, args, kwargs = self._maybe_define_function(args, kwargs) -> 1823 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access 1824 1825 @property
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in _filtered_call(self, args, kwargs) 1139 if isinstance(t, (ops.Tensor, 1140 resource_variable_ops.BaseResourceVariable))), -> 1141 self.captured_inputs) 1142 1143 def _call_flat(self, args, captured_inputs, cancellation_manager=None):
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager) 1222 if executing_eagerly: 1223 flat_outputs = forward_function.call( -> 1224 ctx, args, cancellation_manager=cancellation_manager) 1225 else: 1226 gradient_name = self._delayed_rewrite_functions.register()
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, ctx, args, cancellation_manager) 509 inputs=args, 510 attrs=(“executor_type”, executor_type, “config_proto”, config), –> 511 ctx=ctx) 512 else: 513 outputs = execute.execute_with_cancellation(
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 65 else: 66 message = e.message —> 67 six.raise_from(core._status_to_exception(e.code, message), None) 68 except TypeError as e: 69 keras_symbolic_tensors = [
C:\ProgramData\Anaconda3\lib\site-packages\six.py in raise_from(value, from_value)
CancelledError: [Derived]RecvAsync is cancelled. [[{{node Reshape_17/_52}}]] [[GroupCrossDeviceControlEdges_0/RMSprop/RMSprop/Const/_57]] [Op:__inference_distributed_function_24912]
Function call stack: distributed_function
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 42
Closing as per @AniTho comment. Thank you
I don’t quite understand. How is changing the TF version a solution? An explanation as to why this error is reproducible for TF 2.0 would be much appreciated, as well as a clarification whether or not this error is a TF bug or a configuration issue.
Bringing this back to life - Getting the same error with both Cuda 11.1 and 10.1 in tf 2.3.1 when using GRU. I am running Win10. The suggestions above with
Do not work.
My error is slightly different from the ones above in terms of the text I get back:
I have exprerienced the exact same error message in TF 2.0.0
The problem can be reproduced with one of the tutorial on the Tensorflow website: https://www.tensorflow.org/tutorials/text/text_classification_rnn
Problem happens right after the training is started ( in the 1st epoch):
CancelledError: [Derived]RecvAsync is cancelled. [[{{node Adam/Adam/update/AssignSubVariableOp/_41}}]] [[Reshape_11/_38]] [Op:__inference_distributed_function_6315]
Function call stack: distributed_function
The problem seems to be related to the GPU, if I executed tensorflow witth CPU only, it does not crash. I however use the tensorflow-rocm (with a Vega 56 card), but it’s probably not a coincidence that I get the exact same error message than mentioned above.
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Elementary OS Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NaN TensorFlow installed from (source or binary): pip TensorFlow version (use command below): 2.0.0 Python version: 3.7.5 Bazel version (if compiling from source): GCC/Compiler version (if compiling from source): CUDA/cuDNN version: ROCM 2.9 GPU model and memory: Vega 56 8Gb
same issue here with LSTM on GPU appears to be solved by:
EDIT: spoke too soon, somehow this works on one machine but not another?
@oanush why was this issue closed?
I am receiving a similar error.
CancelledError: [Derived]RecvAsync is cancelled. [[{{node Adam/Adam/update/AssignSubVariableOp/_33}}]] [[gradient_tape/sequential_1/embedding_1/embedding_lookup/Reshape/_30]] [Op:__inference_train_function_11493]
I am using tf 2.3.1 and Cuda Toolkit 11.1
I think I have fixed the issue. The root of this was
bucket_by_sequence_length
and me settingdrop_remainder=False
.What seems to happen here is that there are batches which do not have enough samples s.t. there weren’t enough examples for all cards. Since I set
drop_remainder=True
I didn’t get this error anymore. So, make sure that you are not running into the same issue.Also encountering the same error. I’m using relatively simple Keras code as follows:
And the stack is:
I’m running in Colab, so everything should be up to date. Attempting to use
results in
Just returned to Tensorflow after a monthlong hiatus, but I’ve never seen this before. Not sure why it’s a closed issue, since it’s clearly been around for a year or so. Interestingly, when I use word-level instead of character-level encodings and use a smaller model (1mil parameters instead of the current 20mil), I have no issues.
EDIT: I tried using a TPU instead of a GPU to circumvent this problem, and the session crashed after using all available memory. Seems to be more related to the size of the model, not the GPU specifically?
EDIT 2: Slashed the network size from 20mil to 3 mil params, reduced the embedding dimensionality, and cut the batch size. Runs on GPU perfectly fine, but it’s very slow, an hour for a single epoch. Further hyperparameter tweaking reduces it to 15 minutes per epoch. Definitely seems to be tied to network size and memory issues.
Also experienced this, also with LSTM models. Training runs for a while eg. 100ish batches, then this error comes up.
The issue is closed because, well, not sure.
While I can’t really help without specifics, immediate recommendations are: -Reduce model size -Reduce dataset size -Use smaller batches -Batch items into a tf.dataset object
Interestingly, I finished training a large 23M param GRU on the same dataset as before without encountering issues. Try unrolling the RNN layers and batching items in buckets of 64
Same issue!
And tf version
2.3.0
Training seems to stop mid way!
I had same issue with latest TF 2.0.0 / CUDNN. I see lots of hits on this issue when searching.
I think this issue should be reopened @oanush as TF 2.0 experiences problem with toy / example solution. Migrating to a previous version isn’t a fix.
The code seems to be around here. https://github.com/tensorflow/tensorflow/blob/81f844c1ff2bee0c3a98a7fff7b308ad77d85309/tensorflow/core/framework/rendezvous.h
I am testing on a smallish 6GB GEForce GTX 1660 Ti so perhaps it’s just running out of memory and giving a bad error? Might be an Nvidia driver issue rather than Tensorflow’s interface.
Adding validation_split seems to cause kernel shutdowns and the error above in jupyter.
Setting
in system environment (Windows) and restarting the shell / restarting Jupyter worked.
Perhaps this needs to be set by default or parameter documented / set as part of Tensorflow api?