tensorflow: Tensorflow 2.1 Error “when finalizing GeneratorDataset iterator” - a memory leak?

Reopening of issue #35100, as more and more people report to still have the same problem:

Problem description

I am using TensorFlow 2.1.0 for image classification under Centos Linux. As my image training data set is growing, I have to start using a Generator as I do not have enough RAM to hold all pictures. I have coded the Generator based on this tutorial.

It seems to work fine, until my program all the sudden gets killed without an error message:

Epoch 6/30
2020-03-08 13:28:11.361785: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
43/43 [==============================] - 54s 1s/step - loss: 5.6839 - accuracy: 0.4669
Epoch 7/30
2020-03-08 13:29:05.511813: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 7/43 [===>..........................] - ETA: 1:04 - loss: 4.3953 - accuracy: 0.5268Killed

Looking at the growing memory consumption with linux’s top, I suspect a memory leak?

What I have tried

  • The suggestion to switch to TF nightly build version. For me it did not help, also downgrading to TF2.0.1 did not help

  • There is a discussion suggesting that it is important, that ‘steps_per_epoch’ and ‘batch size’ correspond (whatever this exactly means) - I played with it without finding any improvement.

  • Trying to narrow down by looking at the size development of all variables in my Generator

Relevant code snippets

class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, labels, dir, n_classes):
        'Initialization'
        config = configparser.ConfigParser()
        config.sections()
        config.read('config.ini')

        self.dim = (int(config['Basics']['PicHeight']),int(config['Basics']['PicWidth']))
        self.batch_size = int(config['HyperParameter']['batchsize'])
        self.labels = labels
        self.list_IDs = list_IDs
        self.dir = dir
        self.n_channels = 3
        self.n_classes = n_classes
        self.on_epoch_end()        


    def __len__(self):
        'Denotes the number of batches per epoch'
        return math.floor(len(self.list_IDs) / self.batch_size)

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)

        return X, y, [None]

being called by

        training_generator = datagenerator.DataGenerator(train_files, labels, dir, len(self.class_names))
        self.model.fit(x=training_generator,
                    use_multiprocessing=False,
                    workers=6, 
                    epochs=self._Epochs, 
                    steps_per_epoch = len(training_generator),
                    callbacks=[LoggingCallback(self.logger.debug)])

I have tried running the exact same code under Windows 10, which gives me the following error:

Epoch 9/30
2020-03-08 20:49:37.555692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
41/41 [==============================] - 75s 2s/step - loss: 2.0167 - accuracy: 0.3133
Epoch 10/30
2020-03-08 20:50:52.986306: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 1/41 [..............................] - ETA: 2:36 - loss: 1.6237 - accuracy: 0.39062020-03-08 20:50:57.689373: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2020-03-08 20:50:57.766163: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[{{node MatMul_6}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 2/41 [>.............................] - ETA: 2:02 - loss: 1.6237 - accuracy: 0.3906Traceback (most recent call last):
  File "run.py", line 83, in <module>
    main()
  File "run.py", line 70, in main
    accuracy, num_of_classes = train_Posture(unique_name)
  File "run.py", line 31, in train_Posture
    acc = neuro.train(picdb, train_ids, test_ids, "Posture")
  File "A:\200307 3rd Try\neuro.py", line 161, in train
    callbacks=[LoggingCallback(self.logger.debug)])
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
    ctx=ctx)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node MatMul_6 (defined at A:\200307 3rd Try\neuro.py:161) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_764]

Function call stack:
distributed_function

2020-03-08 20:51:00.785175: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 11
  • Comments: 19 (3 by maintainers)

Commits related to this issue

Most upvoted comments

I have also the same issue: Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

@wendell-hom

I used “conda install tensorflow-gpu” to install my tensorflow environment. How do I consume this fix into my conda env? =)

This issue happen to me today, and I also happen to use conda so I think I might share it to you as well,

conda create -n tf22 python=3.7 cudnn cupti cudatoolkit=10.1.243
pip install tensorflow==2.2.0rc3
conda activate tf22

For tf2, tensorflow already support GPU if it can open all the libary

Got same problem here. And this only happens when I specify the number of workers. But removing this argument will slow down the process.