tensorflow: Tensorflow 2.1 Error “when finalizing GeneratorDataset iterator” - a memory leak?
Reopening of issue #35100, as more and more people report to still have the same problem:
Problem description
I am using TensorFlow 2.1.0 for image classification under Centos Linux. As my image training data set is growing, I have to start using a Generator as I do not have enough RAM to hold all pictures. I have coded the Generator based on this tutorial.
It seems to work fine, until my program all the sudden gets killed without an error message:
Epoch 6/30
2020-03-08 13:28:11.361785: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
43/43 [==============================] - 54s 1s/step - loss: 5.6839 - accuracy: 0.4669
Epoch 7/30
2020-03-08 13:29:05.511813: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
7/43 [===>..........................] - ETA: 1:04 - loss: 4.3953 - accuracy: 0.5268Killed
Looking at the growing memory consumption with linux’s top, I suspect a memory leak?
What I have tried
-
The suggestion to switch to TF nightly build version. For me it did not help, also downgrading to TF2.0.1 did not help
-
There is a discussion suggesting that it is important, that ‘steps_per_epoch’ and ‘batch size’ correspond (whatever this exactly means) - I played with it without finding any improvement.
-
Trying to narrow down by looking at the size development of all variables in my Generator
Relevant code snippets
class DataGenerator(tf.keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, labels, dir, n_classes):
'Initialization'
config = configparser.ConfigParser()
config.sections()
config.read('config.ini')
self.dim = (int(config['Basics']['PicHeight']),int(config['Basics']['PicWidth']))
self.batch_size = int(config['HyperParameter']['batchsize'])
self.labels = labels
self.list_IDs = list_IDs
self.dir = dir
self.n_channels = 3
self.n_classes = n_classes
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return math.floor(len(self.list_IDs) / self.batch_size)
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temp)
return X, y, [None]
being called by
training_generator = datagenerator.DataGenerator(train_files, labels, dir, len(self.class_names))
self.model.fit(x=training_generator,
use_multiprocessing=False,
workers=6,
epochs=self._Epochs,
steps_per_epoch = len(training_generator),
callbacks=[LoggingCallback(self.logger.debug)])
I have tried running the exact same code under Windows 10, which gives me the following error:
Epoch 9/30
2020-03-08 20:49:37.555692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
41/41 [==============================] - 75s 2s/step - loss: 2.0167 - accuracy: 0.3133
Epoch 10/30
2020-03-08 20:50:52.986306: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
1/41 [..............................] - ETA: 2:36 - loss: 1.6237 - accuracy: 0.39062020-03-08 20:50:57.689373: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2020-03-08 20:50:57.766163: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node MatMul_6}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
2/41 [>.............................] - ETA: 2:02 - loss: 1.6237 - accuracy: 0.3906Traceback (most recent call last):
File "run.py", line 83, in <module>
main()
File "run.py", line 70, in main
accuracy, num_of_classes = train_Posture(unique_name)
File "run.py", line 31, in train_Posture
acc = neuro.train(picdb, train_ids, test_ids, "Posture")
File "A:\200307 3rd Try\neuro.py", line 161, in train
callbacks=[LoggingCallback(self.logger.debug)])
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
total_epochs=epochs)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
result = self._call(*args, **kwds)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
self.captured_inputs)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
ctx=ctx)
File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[node MatMul_6 (defined at A:\200307 3rd Try\neuro.py:161) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_distributed_function_764]
Function call stack:
distributed_function
2020-03-08 20:51:00.785175: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 11
- Comments: 19 (3 by maintainers)
I have also the same issue: Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
@wendell-hom
This issue happen to me today, and I also happen to use conda so I think I might share it to you as well,
For tf2, tensorflow already support GPU if it can open all the libary
Got same problem here. And this only happens when I specify the number of workers. But removing this argument will slow down the process.