tensorflow: CancelledError: [_Derived_]RecvAsync is cancelled. [[{{node Reshape_17/_52}}]] [[GroupCrossDeviceControlEdges_0/RMSprop/RMSprop/Const/_57]] [Op:__inference_distributed_function_24912]

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Home
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NaN
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version (use command below): 2.0.0
  • Python version: 3.7.4
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: v10.1
  • GPU model and memory: GTX 1060 6GB

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior During fitting the data, it gives Cancelled Error in the very first batch Describe the expected behavior To “Fit” the model without error Code to reproduce the issue max_len_text=275 max_len_summary=28 from keras import backend as K K.clear_session() latent_dim = 500 encoder_inputs = Input(shape=(max_len_text,)) enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs) encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True) encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb) encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True) encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1) encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True) encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
decoder_inputs = Input(shape=(None,)) dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True) dec_emb = dec_emb_layer(decoder_inputs) decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True) decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c]) attn_layer = AttentionLayer(name='attention_layer') attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs]) decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out]) decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax')) decoder_outputs = decoder_dense(decoder_concat_input) model = Model([encoder_inputs, decoder_inputs], decoder_outputs) model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy') es = EarlyStopping(monitor='val_loss', mode='min', verbose=1) Model_training=model.fit([X_train,Y_train[:,:-1]], Y_train.reshape(Y_train.shape[0],Y_train.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=256, validation_data=([X_test,Y_test[:,:-1]],Y_test.reshape(Y_test.shape[0],Y_test.shape[1], 1)[:,1:]))

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Train on 314860 samples, validate on 78716 samples Epoch 1/50 256/314860 […] - ETA: 5:22WARNING:tensorflow:Early stopping conditioned on metric val_loss which is not available. Available metrics are:


CancelledError Traceback (most recent call last) <ipython-input-30-8fb3a6c938b7> in <module> 1 Model_training=model.fit([X_train,Y_train[:,:-1]], Y_train.reshape(Y_train.shape[0],Y_train.shape[1], 1)[:,1:] 2 ,epochs=50,callbacks=[es],batch_size=256, validation_data=([X_test,Y_test[:,:-1]], ----> 3 Y_test.reshape(Y_test.shape[0],Y_test.shape[1], 1)[:,1:]))

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs) 726 max_queue_size=max_queue_size, 727 workers=workers, –> 728 use_multiprocessing=use_multiprocessing) 729 730 def evaluate(self,

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, **kwargs) 322 mode=ModeKeys.TRAIN, 323 training_context=training_context, –> 324 total_epochs=epochs) 325 cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN) 326

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs) 121 step=step, mode=mode, size=current_batch_size) as batch_logs: 122 try: –> 123 batch_outs = execution_function(iterator) 124 except (StopIteration, errors.OutOfRangeError): 125 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py in execution_function(input_fn) 84 # numpy translates Tensors to values in Eager mode. 85 return nest.map_structure(_non_none_constant_value, —> 86 distributed_function(input_fn)) 87 88 return execution_function

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py in call(self, *args, **kwds) 455 456 tracing_count = self._get_tracing_count() –> 457 result = self._call(*args, **kwds) 458 if tracing_count == self._get_tracing_count(): 459 self._call_counter.called_without_tracing()

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py in _call(self, *args, **kwds) 485 # In this case we have created variables on the first call, so we run the 486 # defunned version which is guaranteed to never create variables. –> 487 return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable 488 elif self._stateful_fn is not None: 489 # Release the lock early so that multiple threads can perform the call

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, *args, **kwargs) 1821 “”“Calls a graph function specialized to the inputs.”“” 1822 graph_function, args, kwargs = self._maybe_define_function(args, kwargs) -> 1823 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access 1824 1825 @property

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in _filtered_call(self, args, kwargs) 1139 if isinstance(t, (ops.Tensor, 1140 resource_variable_ops.BaseResourceVariable))), -> 1141 self.captured_inputs) 1142 1143 def _call_flat(self, args, captured_inputs, cancellation_manager=None):

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager) 1222 if executing_eagerly: 1223 flat_outputs = forward_function.call( -> 1224 ctx, args, cancellation_manager=cancellation_manager) 1225 else: 1226 gradient_name = self._delayed_rewrite_functions.register()

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, ctx, args, cancellation_manager) 509 inputs=args, 510 attrs=(“executor_type”, executor_type, “config_proto”, config), –> 511 ctx=ctx) 512 else: 513 outputs = execute.execute_with_cancellation(

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 65 else: 66 message = e.message —> 67 six.raise_from(core._status_to_exception(e.code, message), None) 68 except TypeError as e: 69 keras_symbolic_tensors = [

C:\ProgramData\Anaconda3\lib\site-packages\six.py in raise_from(value, from_value)

CancelledError: [Derived]RecvAsync is cancelled. [[{{node Reshape_17/_52}}]] [[GroupCrossDeviceControlEdges_0/RMSprop/RMSprop/Const/_57]] [Op:__inference_distributed_function_24912]

Function call stack: distributed_function

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 42

Most upvoted comments

Closing as per @AniTho comment. Thank you

I don’t quite understand. How is changing the TF version a solution? An explanation as to why this error is reproducible for TF 2.0 would be much appreciated, as well as a clarification whether or not this error is a TF bug or a configuration issue.

Bringing this back to life - Getting the same error with both Cuda 11.1 and 10.1 in tf 2.3.1 when using GRU. I am running Win10. The suggestions above with

1) import os
   os.environ["TF_FORCE_GPU_ALLOW_GROWTH"]="true"

2) TF_FORCE_GPU_ALLOW_GROWTH=1

3) physical_devices = tf.config.list_physical_devices('GPU')
   tf.config.experimental.set_memory_growth(physical_devices[0], True)

Do not work.

My error is slightly different from the ones above in terms of the text I get back:

	 [[{{node Adam/Adam/update/AssignSubVariableOp/_27}}]]
	 [[gradient_tape/sequential/embedding/embedding_lookup/Reshape/_24]] [Op:__inference_train_function_2598]

Function call stack:
train_function```

I have exprerienced the exact same error message in TF 2.0.0

The problem can be reproduced with one of the tutorial on the Tensorflow website: https://www.tensorflow.org/tutorials/text/text_classification_rnn

Problem happens right after the training is started ( in the 1st epoch):

CancelledError: [Derived]RecvAsync is cancelled. [[{{node Adam/Adam/update/AssignSubVariableOp/_41}}]] [[Reshape_11/_38]] [Op:__inference_distributed_function_6315]

Function call stack: distributed_function

The problem seems to be related to the GPU, if I executed tensorflow witth CPU only, it does not crash. I however use the tensorflow-rocm (with a Vega 56 card), but it’s probably not a coincidence that I get the exact same error message than mentioned above.

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Elementary OS Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NaN TensorFlow installed from (source or binary): pip TensorFlow version (use command below): 2.0.0 Python version: 3.7.5 Bazel version (if compiling from source): GCC/Compiler version (if compiling from source): CUDA/cuDNN version: ROCM 2.9 GPU model and memory: Vega 56 8Gb

same issue here with LSTM on GPU appears to be solved by:

import os
os.environ["TF_FORCE_GPU_ALLOW_GROWTH"]="true"

EDIT: spoke too soon, somehow this works on one machine but not another?

@oanush why was this issue closed?

I am receiving a similar error.

CancelledError: [Derived]RecvAsync is cancelled. [[{{node Adam/Adam/update/AssignSubVariableOp/_33}}]] [[gradient_tape/sequential_1/embedding_1/embedding_lookup/Reshape/_30]] [Op:__inference_train_function_11493]

I am using tf 2.3.1 and Cuda Toolkit 11.1

I think I have fixed the issue. The root of this was bucket_by_sequence_length and me setting drop_remainder=False.

What seems to happen here is that there are batches which do not have enough samples s.t. there weren’t enough examples for all cards. Since I set drop_remainder=True I didn’t get this error anymore. So, make sure that you are not running into the same issue.

Also encountering the same error. I’m using relatively simple Keras code as follows:

def build_model(in_dim, embedding_dim, maxlength):
    inputs = tf.keras.layers.Input(shape=(None,))
    embeddings = tf.keras.layers.Embedding(in_dim, embedding_dim, input_length=maxlength)(inputs)
    x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_units, return_sequences=True))(embeddings)
    x = tf.keras.layers.Dropout(0.1)(x)
    x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_units, return_sequences=True))(x)
    x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_units, return_sequences=True))(x)
    x = tf.keras.layers.Dropout(0.1)(x)
    x = tf.keras.layers.LSTM(LSTM_units, return_sequences=True)(x)

    x_h = tf.keras.layers.concatenate([tf.keras.layers.GlobalAveragePooling1D()(x),
                                     tf.keras.layers.GlobalMaxPool1D()(x)])
    x_h = tf.keras.layers.Dropout(0.2)(x_h)
    x_h = tf.keras.layers.Dense(dense_units, activation='relu')(x_h)
    x_h = tf.keras.layers.Dropout(0.3)(x_h)
    x_h = tf.keras.layers.add([x_h, tf.keras.layers.Dense(dense_units, activation='relu')(x_h)])

    outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x_h)

    model = keras.Model(inputs=inputs, outputs=outputs)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model_1 = build_model(num_chars, embedding_dimension, max_length)

callbacks = [tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, verbose=1, mode='max', patience=2, min_lr=0.000001),
             tf.keras.callbacks.ModelCheckpoint('/content/saved_at_{epoch}.h5', monitor='val_loss', save_best_only=True)
]


history = model_1.fit(x_train,
                     np.array(y_train), 
                     epochs=epochs, 
                     batch_size=batch_size, 
                     callbacks=callbacks, 
                     validation_split=val_split,
                     class_weight=class_weights
                     )

And the stack is:

Epoch 1/20
---------------------------------------------------------------------------
CancelledError                            Traceback (most recent call last)
<ipython-input-11-3c53d64825d4> in <module>()
     10                      callbacks=callbacks,
     11                      validation_split=val_split,
---> 12                      class_weight=class_weights
     13                      )

8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

CancelledError:  [_Derived_]RecvAsync is cancelled.
	 [[{{node Adam/Adam/update/AssignSubVariableOp/_57}}]]
	 [[gradient_tape/functional_1/embedding/embedding_lookup/Reshape/_54]] [Op:__inference_train_function_18787]

Function call stack:
train_function

I’m running in Colab, so everything should be up to date. Attempting to use

import os
os.environ["TF_FORCE_GPU_ALLOW_GROWTH"]="true"

results in

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-12-23cb5f878315> in <module>()
     13                      callbacks=callbacks,
     14                      validation_split=val_split,
---> 15                      class_weight=class_weights
     16                      )

6 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

InternalError: GPU sync failed

Just returned to Tensorflow after a monthlong hiatus, but I’ve never seen this before. Not sure why it’s a closed issue, since it’s clearly been around for a year or so. Interestingly, when I use word-level instead of character-level encodings and use a smaller model (1mil parameters instead of the current 20mil), I have no issues.

EDIT: I tried using a TPU instead of a GPU to circumvent this problem, and the session crashed after using all available memory. Seems to be more related to the size of the model, not the GPU specifically?

EDIT 2: Slashed the network size from 20mil to 3 mil params, reduced the embedding dimensionality, and cut the batch size. Runs on GPU perfectly fine, but it’s very slow, an hour for a single epoch. Further hyperparameter tweaking reduces it to 15 minutes per epoch. Definitely seems to be tied to network size and memory issues.

Also experienced this, also with LSTM models. Training runs for a while eg. 100ish batches, then this error comes up.

The issue is closed because, well, not sure.

While I can’t really help without specifics, immediate recommendations are: -Reduce model size -Reduce dataset size -Use smaller batches -Batch items into a tf.dataset object

Interestingly, I finished training a large 23M param GRU on the same dataset as before without encountering issues. Try unrolling the RNN layers and batching items in buckets of 64

Same issue!

nvcc: NVIDIA (R) Cuda compiler 
Copyright (c) 2005-2019 NVIDIA 
Built on Sun_Jul_28_19:07:
Cuda compilation tools, release 10.1, V10.1.243 

And tf version 2.3.0

Training seems to stop mid way!

#Epoch 1/10
WARNING:tensorflow:From /home/aman_arora/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.
INFO:tensorflow:batch_all_reduce: 213 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 213 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 213 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 213 all-reduces with algorithm = nccl, num_packs = 1
 7845/16191 [=============>................] - ETA: 40:32 - loss: 0.2051
---------------------------------------------------------------------------
CancelledError                            Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
    106   def _method_wrapper(self, *args, **kwargs):
    107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
--> 108       return method(self, *args, **kwargs)
    109 
    110     # Running inside `run_distribute_coordinator` already.
~/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1096                 batch_size=batch_size):
   1097               callbacks.on_train_batch_begin(step)
-> 1098               tmp_logs = train_function(iterator)
   1099               if data_handler.should_sync:
   1100                 context.async_wait()
~/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    778       else:
    779         compiler = "nonXla"
--> 780         result = self._call(*args, **kwds)
    781 
    782       new_tracing_count = self._get_tracing_count()
~/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    805       # In this case we have created variables on the first call, so we run the
    806       # defunned version which is guaranteed to never create variables.
--> 807       return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
    808     elif self._stateful_fn is not None:
    809       # Release the lock early so that multiple threads can perform the call
~/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
   2827     with self._lock:
   2828       graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2829     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
   2830 
   2831   @property
~/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py in _filtered_call(self, args, kwargs, cancellation_manager)
   1846                            resource_variable_ops.BaseResourceVariable))],
   1847         captured_inputs=self.captured_inputs,
-> 1848         cancellation_manager=cancellation_manager)
   1849 
   1850   def _call_flat(self, args, captured_inputs, cancellation_manager=None):
~/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1922       # No tape is watching; skip to running the function.
   1923       return self._build_call_outputs(self._inference_function.call(
-> 1924           ctx, args, cancellation_manager=cancellation_manager))
   1925     forward_backward = self._select_forward_and_backward_functions(
   1926         args,
~/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
    548               inputs=args,
    549               attrs=attrs,
--> 550               ctx=ctx)
    551         else:
    552           outputs = execute.execute_with_cancellation(
~/.local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:
CancelledError:  [_Derived_]RecvAsync is cancelled.
 [[{​​​​​​​​{​​​​​​​​node div_no_nan/ReadVariableOp_3/_596}​​​​​​​​}​​​​​​​​]] [Op:__inference_train_function_70284]
Function call stack:
train_function

I had same issue with latest TF 2.0.0 / CUDNN. I see lots of hits on this issue when searching.

I think this issue should be reopened @oanush as TF 2.0 experiences problem with toy / example solution. Migrating to a previous version isn’t a fix.

The code seems to be around here. https://github.com/tensorflow/tensorflow/blob/81f844c1ff2bee0c3a98a7fff7b308ad77d85309/tensorflow/core/framework/rendezvous.h

I am testing on a smallish 6GB GEForce GTX 1660 Ti so perhaps it’s just running out of memory and giving a bad error? Might be an Nvidia driver issue rather than Tensorflow’s interface.

Adding validation_split seems to cause kernel shutdowns and the error above in jupyter.

history = model.fit(x, y, batch_size=256, epochs=2, validation_split=0.33)

Setting

TF_FORCE_GPU_ALLOW_GROWTH=1

in system environment (Windows) and restarting the shell / restarting Jupyter worked.

Perhaps this needs to be set by default or parameter documented / set as part of Tensorflow api?

tf.config.experimental.set_memory_growth(gpu, True)
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.66       Driver Version: 441.66       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 166... WDDM  | 00000000:01:00.0  On |                  N/A |
| 27%   38C    P8    13W / 120W |   6028MiB /  6144MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+