tensorflow: Error occurred when finalizing GeneratorDataset iterator
System information
- OS Platform and Distribution: Arch Linux, 5.4.2-arch1-1-ARCH
- TensorFlow installed from: binary
- TensorFlow version: 2.1.0rc0-1
- Keras version: 2.2.4-tf
- Python version: 3.8
- GPU model and memory: 2x GTX 1080 Ti 11GB"`
Describe the current behavior executing Tensorflow’s MNIST handwriting example produces error: the error dissapears if the code doesn’t use OneDeviceStrategy or MirroredStrategy
W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Code to reproduce the issue
import tensorflow as tf
import tensorflow_datasets as tfds
import time
from tensorflow.keras.optimizers import Adam
def build_model():
filters = 48
units = 24
kernel_size = 7
learning_rate = 1e-4
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(filters=filters, kernel_size=(kernel_size, kernel_size), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(learning_rate), metrics=['accuracy'])
return model
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets['train'], datasets['test']
num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples
strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0')
BUFFER_SIZE = 10000
BATCH_SIZE = 32
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
train_dataset = mnist_train.map(scale).shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
eval_dataset = mnist_test.map(scale).repeat().batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
with strategy.scope():
model = build_model()
epochs=5
start = time.perf_counter()
model.fit(
train_dataset,
validation_data=eval_dataset,
steps_per_epoch=num_train_examples/epochs,
validation_steps=num_test_examples/epochs,
epochs=epochs)
elapsed = time.perf_counter() - start
print('elapsed: {:0.3f}'.format(elapsed))
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 15
- Comments: 51 (9 by maintainers)
@jsimsa Any update on this? I’m getting this exact message and it looks like my model.fit is not doing it’s thing on validation dataset during training.
@guptapriya I realized that generator dataset is used in multi-device iterator. This seems related to newly added support for cancellation in tf.data.
The good news is that, as you pointed out, the warning is superfluous. The bad news is that, as far as I can tell, this warning will be present for all tf.distribute jobs in TF 2.1 (given how tf.data cancellation is implemented). I will look into having a fix for this cherrypicked into TF 2.1.
Adding this code snippet fixes this issue for me when using RTX GPUs:
This is something I have to do in my training scripts as well. Might help someone 👍
This warning is spurious and should be removed by https://github.com/tensorflow/tensorflow/commit/b6edd34c5858ab0ab4380da774e7e2fd81a92da0
I have the same problem (using model.fit() ,and one numpy generator but keras.sequence) I’m using linux redhat, python 3.6 and tensorflow 2.4.1. then I got the error File “/root/python_env/anaconda3/lib/python3.6/copy.py”, line 215, in _deepcopy_list append(deepcopy(a, memo)) File “/root/python_env/anaconda3/lib/python3.6/copy.py”, line 180, in deepcopy y = _reconstruct(x, memo, *rv) File “/root/python_env/anaconda3/lib/python3.6/copy.py”, line 274, in _reconstruct y = func(*args) File “/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py”, line 190, in init if value < 0: RecursionError: maximum recursion depth exceeded in comparison 2021-03-26 14:50:45.148650: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]
@Tuxius I seem have the same issue. Should this be reopened? Why is it closed anyway?
On my side it seems to happen when early stoppping triggers. So it does not cancel the training. And I get no OOM message.
devices = tf.config.experimental.list_physical_devices(‘GPU’) tf.config.experimental.set_memory_growth(devices[0], True)
not work for me
I’m also experiencing this on the official Google Cloud Platform tf2-gpu.2-1.m42 image with Python 3.5.3.
@jsimsa thanks for the heads up! I updated to tf-nightly-gpu and the error:
W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelledwent away.I think the error is not spurious. It indicates the memory leak for each epoch. Try to train for extremely large epochs. You will use up all memory.
I solve this problem (in tensorflow 2.5)
I suppose a file named ‘train.py’ to run (this is an example)
When run .py files in terminal
I found a reason for the problem on my computer - YMMV. I was using the ModelCheckpoint callback to save the best model, and if there was a model with that name already in the folder, I got the error. Removing or renaming the model with that name fixed the issue. Windows 10 system, Python 3.7.4.
@spate141 , @jsimsa , I have the same error and the memory leak as well. My configuration is TF 2.1.0, Ubuntu 18.04, Python 3.6.10. I’m using fit_generator with generators for train and validation. When I try to train a model, it directly starts filling up the cache memory until it crashes. I observed that the issue might be related to the validation part because it generates way more number of batches than it supposed to. This is error is not spurious.
I’ve downgraded my system:
Still facing the error:
seams to be related to tensorflow-2.1.0-rc1
The following solved the issue for me:
Call this function at the start of your script
Had the same problem. Memory leak and crash after some number of epochs. Looks like the
ModelCheckpointcallback is a culprit. Removing it solved the issue.Ideas from stackoverflow. I just directly copy the code from deeplearning.ai in colab. A part of it goes like this: `train_generator = train_datagen.flow_from_directory( ‘horse-or-human/’, # This is the source directory for training images target_size=(300, 300), # All images will be resized to 300x300 batch_size=128, # Since we use binary_crossentropy loss, we need binary labels class_mode=‘binary’)
history = model.fit( train_generator, steps_per_epoch=8,
epochs=15, verbose=1)` and there are 1027 images. 128*8=1024, less than 1027. I set steps_per_epoch to 9, the error disappear. So, for me the problem arises when there is wrong correspondence on the batch size and steps(iterations). At least this is one of the cases for the error. Here is the original answer https://stackoverflow.com/questions/60000573/error-occurred-when-finalizing-generatordataset-iterator-cancelled-operation-w
For me the results aren’t reproducible from run to run either, even with tf.random.set_seed(), but I suspect it has to do with multiple workers for my image augmentation generator. On Sat, Mar 28, 2020, 12:04 PM flydragon2018 notifications@github.com wrote:
but the training result seems not stable. train/val loss/accuracy up and downs too much.
There was a bug for Keras sequence multi-processing implementation that was fixed in https://github.com/tensorflow/tensorflow/commit/e918c6e6fab5d0005fcde83d57e92b70343d3553. This will be available in TF 2.2 and should be already available in TF nightly.
Problem description
I am using TensorFlow 2.1.0 for image classification under Centos Linux. As my image training data set is growing, I have to start using a Generator as I do not have enough RAM to hold all pictures. I have coded the Generator based on this tutorial.
It seems to work fine, until my program all the sudden gets killed without an error message:
Looking at the growing memory consumption with linux’s top, I suspect a memory leak?
What I have tried
The above suggestion to switch to TF nightly build version. For me it did not help, also downgrading to TF2.0.1 did not help
There is a discussion suggesting that it is important, that ‘steps_per_epoch’ and ‘batch size’ correspond (whatever this exactly means) - I played with it without finding any improvement.
Trying to narrow down by looking at the size development of all variables in my Generator
Relevant code snippets
being called by
I have tried running the exact same code under Windows 10, which gives me the following error:
@Mauhing Thanks for the scripts demo; I hope the team looked into it. And yes, I tried the small batch_size and it seems to be working fine as of now. Luckily, my model is reaching to a good validation accuracy point in relatively small number of epochs… so there is that! 🤷🏻♂️
I can verify this error with python 3.8 and python-tensorflow-opt-cuda 2.1.0rc1-2 on arch linux. This error is weirdly not present if you import only the generator from tensorflow, and everything else from Keras.
@olk, I tried reproducing the reported issue but it worked as expected. Please take a look at the gist. Thanks!