tensorflow: Iterator on cached tf.data Dataset cannot be reinitialized

Found a likely bug when trying to use a reinitializable iterator to read from two cached datasets, one for validation and one for training. The iterator can however only be initialized once per cached dataset. Seems to me like the iterator should remove the lock file when being reinitialized, it is not in my case and that is why I get this issue. Here’s a minimal example with only one cached dataset.

(basic system information below)

Example

import os

import numpy as np
import tensorflow as tf


data = np.random.rand(10, 3).astype(np.float32)
dataset = tf.data.Dataset.from_tensor_slices(data)
batches = dataset.shuffle(10).repeat().batch(5)

config = tf.ConfigProto(device_count = {'GPU': 0})
sess = tf.Session(config=config)

cache_dir = os.path.join(os.getcwd(), 'cache_dir')
try:
    os.makedirs(cache_dir)
except OSError:
    print('Cache directory already exists')

cached = batches.cache(os.path.join(cache_dir, 'cache'))
iterator = tf.data.Iterator.from_structure(output_types=tf.float32, output_shapes=(5, 3))
batch = iterator.get_next()

init1 = iterator.make_initializer(cached)
init2 = iterator.make_initializer(batches)

sess.run(init1)
sess.run(batch)

array([[ 0.11960778, 0.3081578 , 0.96522039], [ 0.90339011, 0.12458269, 0.30650312], [ 0.58160347, 0.55877644, 0.50363588], [ 0.2350398 , 0.33509603, 0.4165386 ], [ 0.76757395, 0.50134581, 0.93601096]], dtype=float32)

sess.run(init2)
sess.run(batch)

array([[ 0.76757395, 0.50134581, 0.93601096], [ 0.2350398 , 0.33509603, 0.4165386 ], [ 0.90339011, 0.12458269, 0.30650312], [ 0.13266359, 0.82675195, 0.26691398], [ 0.58160347, 0.55877644, 0.50363588]], dtype=float32)

sess.run(init1)
sess.run(batch)

AlreadyExistsError (see above for traceback): There appears to be a concurrent caching iterator running - cache lockfile already exists (‘/home/ubuntu/ai_notebooks/notebooks/projects/deep-purple/cache_dir/cache.lockfile’). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator. Lockfile contents: Created at: 1513187725 [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[5,3]], output_types=[DT_FLOAT], _device=“/job:localhost/replica:0/task:0/device:CPU:0”]]

Sytem information

Tensorflow version: v1.4.0-rc1-11-g130a514 1.4.0 (installed from pip) Python version: 3.5.2 OS: Linux Ubuntu 16.04.3 CUDA: 8.0.61 cuDNN: 6

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 15 (12 by maintainers)

Commits related to this issue

Avoid reinitializing the same iterator before end of epoch. https://github.com/tensorflow/tensorflow/issues/15343 — committed to mozilla/DeepSpeech by reuben 5 years ago

Most upvoted comments

@saeta Thanks for helping out, that seemed to solve that particular problem.

I was applying the repeat() before caching (both in the example and in my real data loader), so I assume the cache releases the lock and produces the index file on tf.errors.OutOfRangeError. When I move the repeat to after the caching and make sure to run through all data this example works.

Perhaps this behaviour could be documented. Some interactions (e.g. repeat - cache, shuffle - cache etc.) are pretty non-obvious to me and it would save me a huge amount of time to get at least some of those details listed.

Thanks again, Agrin

agrinh on Dec 14, 2017

@agrinh Funny you should ask. I and @jsimsa have been working on a datasets performance guide we hope to publish soon, and we cover recommended order-of-operations. I’ll make sure repeat vs cache and shuffle vs cache are included. 😃

Nice sleuthing and getting it all working!!

saeta on Dec 14, 2017