tensorflow: Error occurred when finalizing GeneratorDataset iterator

System information

  • OS Platform and Distribution: Arch Linux, 5.4.2-arch1-1-ARCH
  • TensorFlow installed from: binary
  • TensorFlow version: 2.1.0rc0-1
  • Keras version: 2.2.4-tf
  • Python version: 3.8
  • GPU model and memory: 2x GTX 1080 Ti 11GB"`

Describe the current behavior executing Tensorflow’s MNIST handwriting example produces error: the error dissapears if the code doesn’t use OneDeviceStrategy or MirroredStrategy

W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Code to reproduce the issue

import tensorflow as tf
 import tensorflow_datasets as tfds
 import time
 
 from tensorflow.keras.optimizers import Adam
 
 def build_model():
     filters = 48
     units = 24
     kernel_size = 7
     learning_rate = 1e-4
     model = tf.keras.Sequential([
       tf.keras.layers.Conv2D(filters=filters, kernel_size=(kernel_size, kernel_size), activation='relu', input_shape=(28, 28, 1)),
       tf.keras.layers.MaxPooling2D(),
       tf.keras.layers.Flatten(),
       tf.keras.layers.Dense(units, activation='relu'),
       tf.keras.layers.Dense(10, activation='softmax')
     ])
     model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(learning_rate), metrics=['accuracy'])
     return model
 
 datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
 mnist_train, mnist_test = datasets['train'], datasets['test']
 
 num_train_examples = info.splits['train'].num_examples
 num_test_examples = info.splits['test'].num_examples
 
 strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0')
 
 BUFFER_SIZE = 10000
 BATCH_SIZE = 32
 
 def scale(image, label):
   image = tf.cast(image, tf.float32)
   image /= 255
   return image, label
 
 train_dataset = mnist_train.map(scale).shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
 eval_dataset = mnist_test.map(scale).repeat().batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
 
 with strategy.scope():
   model = build_model()
 
 epochs=5
 start = time.perf_counter()
 model.fit(
         train_dataset,
         validation_data=eval_dataset,
         steps_per_epoch=num_train_examples/epochs,
         validation_steps=num_test_examples/epochs,
         epochs=epochs)
 elapsed = time.perf_counter() - start
 print('elapsed: {:0.3f}'.format(elapsed))

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 15
  • Comments: 51 (9 by maintainers)

Commits related to this issue

Most upvoted comments

@jsimsa Any update on this? I’m getting this exact message and it looks like my model.fit is not doing it’s thing on validation dataset during training.

@guptapriya I realized that generator dataset is used in multi-device iterator. This seems related to newly added support for cancellation in tf.data.

The good news is that, as you pointed out, the warning is superfluous. The bad news is that, as far as I can tell, this warning will be present for all tf.distribute jobs in TF 2.1 (given how tf.data cancellation is implemented). I will look into having a fix for this cherrypicked into TF 2.1.

Adding this code snippet fixes this issue for me when using RTX GPUs:

devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)

This is something I have to do in my training scripts as well. Might help someone 👍

I have the same problem (using model.fit() ,and one numpy generator but keras.sequence) I’m using linux redhat, python 3.6 and tensorflow 2.4.1. then I got the error File “/root/python_env/anaconda3/lib/python3.6/copy.py”, line 215, in _deepcopy_list append(deepcopy(a, memo)) File “/root/python_env/anaconda3/lib/python3.6/copy.py”, line 180, in deepcopy y = _reconstruct(x, memo, *rv) File “/root/python_env/anaconda3/lib/python3.6/copy.py”, line 274, in _reconstruct y = func(*args) File “/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py”, line 190, in init if value < 0: RecursionError: maximum recursion depth exceeded in comparison 2021-03-26 14:50:45.148650: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

@Tuxius I seem have the same issue. Should this be reopened? Why is it closed anyway?

On my side it seems to happen when early stoppping triggers. So it does not cancel the training. And I get no OOM message.

156/156 [==============================] - 86s 550ms/step - loss: 0.0676 - acc: 0.9790 - val_loss: 0.7805 - val_acc: 0.8569
Epoch 17/1000
156/156 [==============================] - 86s 550ms/step - loss: 0.0711 - acc: 0.9748 - val_loss: 0.4852 - val_acc: 0.8875
Epoch 18/1000
156/156 [==============================] - 86s 550ms/step - loss: 0.0638 - acc: 0.9772 - val_loss: 1.1247 - val_acc: 0.8371
2020-03-09 20:41:21.425818: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
[0.8876654] [5]
WARNING:tensorflow:sample_weight modes were coerced from
  {'output': '...'}
    to
  ['...']
Train for 156 steps, validate on 11715 samples
Epoch 1/1000
156/156 [==============================] - 88s 566ms/step - loss: 0.4006 - acc: 0.8377 - val_loss: 1.3430 - val_acc: 0.5214
Epoch 2/1000
156/156 [==============================] - 86s 550ms/step - loss: 0.1554 - acc: 0.9437 - val_loss: 0.7877 - val_acc: 0.8219
Epoch 3/1000

devices = tf.config.experimental.list_physical_devices(‘GPU’) tf.config.experimental.set_memory_growth(devices[0], True)
not work for me

I’m also experiencing this on the official Google Cloud Platform tf2-gpu.2-1.m42 image with Python 3.5.3.

@jsimsa thanks for the heads up! I updated to tf-nightly-gpu and the error: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled went away.

I think the error is not spurious. It indicates the memory leak for each epoch. Try to train for extremely large epochs. You will use up all memory.

I solve this problem (in tensorflow 2.5)

I suppose a file named ‘train.py’ to run (this is an example)

When run .py files in terminal

$ CUDA_VISIBLE_DEVICES=0 python train.py  # Use GPU 0.
$ CUDA_VISIBLE_DEVICES=1 python train.py  # Use GPU 1.
$ CUDA_VISIBLE_DEVICES=2,3 python train.py  # Use GPUs 2 and 3.
  1. Add 3 lines in train.py file
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   
os.environ["CUDA_VISIBLE_DEVICES"]="0"

I found a reason for the problem on my computer - YMMV. I was using the ModelCheckpoint callback to save the best model, and if there was a model with that name already in the folder, I got the error. Removing or renaming the model with that name fixed the issue. Windows 10 system, Python 3.7.4.

@spate141 , @jsimsa , I have the same error and the memory leak as well. My configuration is TF 2.1.0, Ubuntu 18.04, Python 3.6.10. I’m using fit_generator with generators for train and validation. When I try to train a model, it directly starts filling up the cache memory until it crashes. I observed that the issue might be related to the validation part because it generates way more number of batches than it supposed to. This is error is not spurious.

I’ve downgraded my system:

  • Python 3.7.4
  • Tensorflow-2.1.0-rc1

Still facing the error:

Train for 30000.0 steps, validate for 5000.0 steps Epoch 1/2 2019-12-17 19:21:54.361240: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2019-12-17 19:21:55.824790: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-12-17 19:21:56.980785: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found Relying on driver to perform ptx compilation. This message will be only logged once. 30000/30000 [==============================] - 115s 4ms/step - loss: 0.0856 - accuracy: 0.9761 - val_loss: 0.0376 - val_accuracy: 0.9879 Epoch 2/2 29990/30000 [============================>.] - ETA: 0s - loss: 0.0152 - accuracy: 0.99582019-12-17 19:25:28.372294: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled 30000/30000 [==============================] - 111s 4ms/step - loss: 0.0152 - accuracy: 0.9958 - val_loss: 0.0375 - val_accuracy: 0.9889 2019-12-17 19:25:40.010887: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled 2019-12-17 19:25:40.031138: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled elapsed: 226.391

seams to be related to tensorflow-2.1.0-rc1

The following solved the issue for me:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession


def fix_gpu():
    config = ConfigProto()
    config.gpu_options.allow_growth = True
    session = InteractiveSession(config=config)


fix_gpu()

Call this function at the start of your script

Had the same problem. Memory leak and crash after some number of epochs. Looks like the ModelCheckpoint callback is a culprit. Removing it solved the issue.

Ideas from stackoverflow. I just directly copy the code from deeplearning.ai in colab. A part of it goes like this: `train_generator = train_datagen.flow_from_directory( ‘horse-or-human/’, # This is the source directory for training images target_size=(300, 300), # All images will be resized to 300x300 batch_size=128, # Since we use binary_crossentropy loss, we need binary labels class_mode=‘binary’)

history = model.fit( train_generator, steps_per_epoch=8,
epochs=15, verbose=1)` and there are 1027 images. 128*8=1024, less than 1027. I set steps_per_epoch to 9, the error disappear. So, for me the problem arises when there is wrong correspondence on the batch size and steps(iterations). At least this is one of the cases for the error. Here is the original answer https://stackoverflow.com/questions/60000573/error-occurred-when-finalizing-generatordataset-iterator-cancelled-operation-w

For me the results aren’t reproducible from run to run either, even with tf.random.set_seed(), but I suspect it has to do with multiple workers for my image augmentation generator. On Sat, Mar 28, 2020, 12:04 PM flydragon2018 notifications@github.com wrote:

but the training result seems not stable. train/val loss/accuracy up and downs too much.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/35100#issuecomment-605504881, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI2DNKEM2P6L6YZLNMF3FLRJZC4XANCNFSM4J2WWO2A .

but the training result seems not stable. train/val loss/accuracy up and downs too much.

There was a bug for Keras sequence multi-processing implementation that was fixed in https://github.com/tensorflow/tensorflow/commit/e918c6e6fab5d0005fcde83d57e92b70343d3553. This will be available in TF 2.2 and should be already available in TF nightly.

Problem description

I am using TensorFlow 2.1.0 for image classification under Centos Linux. As my image training data set is growing, I have to start using a Generator as I do not have enough RAM to hold all pictures. I have coded the Generator based on this tutorial.

It seems to work fine, until my program all the sudden gets killed without an error message:

Epoch 6/30
2020-03-08 13:28:11.361785: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
43/43 [==============================] - 54s 1s/step - loss: 5.6839 - accuracy: 0.4669
Epoch 7/30
2020-03-08 13:29:05.511813: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 7/43 [===>..........................] - ETA: 1:04 - loss: 4.3953 - accuracy: 0.5268Killed

Looking at the growing memory consumption with linux’s top, I suspect a memory leak?

What I have tried

  • The above suggestion to switch to TF nightly build version. For me it did not help, also downgrading to TF2.0.1 did not help

  • There is a discussion suggesting that it is important, that ‘steps_per_epoch’ and ‘batch size’ correspond (whatever this exactly means) - I played with it without finding any improvement.

  • Trying to narrow down by looking at the size development of all variables in my Generator

Relevant code snippets

class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, labels, dir, n_classes):
        'Initialization'
        config = configparser.ConfigParser()
        config.sections()
        config.read('config.ini')

        self.dim = (int(config['Basics']['PicHeight']),int(config['Basics']['PicWidth']))
        self.batch_size = int(config['HyperParameter']['batchsize'])
        self.labels = labels
        self.list_IDs = list_IDs
        self.dir = dir
        self.n_channels = 3
        self.n_classes = n_classes
        self.on_epoch_end()        


    def __len__(self):
        'Denotes the number of batches per epoch'
        return math.floor(len(self.list_IDs) / self.batch_size)

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)

        return X, y, [None]

being called by

        training_generator = datagenerator.DataGenerator(train_files, labels, dir, len(self.class_names))
        self.model.fit(x=training_generator,
                    use_multiprocessing=False,
                    workers=6, 
                    epochs=self._Epochs, 
                    steps_per_epoch = len(training_generator),
                    callbacks=[LoggingCallback(self.logger.debug)])

I have tried running the exact same code under Windows 10, which gives me the following error:

Epoch 9/30
2020-03-08 20:49:37.555692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
41/41 [==============================] - 75s 2s/step - loss: 2.0167 - accuracy: 0.3133
Epoch 10/30
2020-03-08 20:50:52.986306: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 1/41 [..............................] - ETA: 2:36 - loss: 1.6237 - accuracy: 0.39062020-03-08 20:50:57.689373: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2020-03-08 20:50:57.766163: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[{{node MatMul_6}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 2/41 [>.............................] - ETA: 2:02 - loss: 1.6237 - accuracy: 0.3906Traceback (most recent call last):
  File "run.py", line 83, in <module>
    main()
  File "run.py", line 70, in main
    accuracy, num_of_classes = train_Posture(unique_name)
  File "run.py", line 31, in train_Posture
    acc = neuro.train(picdb, train_ids, test_ids, "Posture")
  File "A:\200307 3rd Try\neuro.py", line 161, in train
    callbacks=[LoggingCallback(self.logger.debug)])
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
    ctx=ctx)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node MatMul_6 (defined at A:\200307 3rd Try\neuro.py:161) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_764]

Function call stack:
distributed_function

2020-03-08 20:51:00.785175: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

@Mauhing Thanks for the scripts demo; I hope the team looked into it. And yes, I tried the small batch_size and it seems to be working fine as of now. Luckily, my model is reaching to a good validation accuracy point in relatively small number of epochs… so there is that! 🤷🏻‍♂️

I can verify this error with python 3.8 and python-tensorflow-opt-cuda 2.1.0rc1-2 on arch linux. This error is weirdly not present if you import only the generator from tensorflow, and everything else from Keras.

@olk, I tried reproducing the reported issue but it worked as expected. Please take a look at the gist. Thanks!