tensorflow: Dataset.cache() Followed by Dataset.zip() Throws AlreadyExistsError

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): OSX Mojave 10.14.5
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): pipenv install --pre tensorflow==2.0.0-beta1
TensorFlow version (use command below): 2.0.0-beta1
Python version: 3.6.8
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: N/A

1v2.0.0-beta0-16-g1d91213fe7

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Cache a dataset
Derive other datasets from it
Zip the derived datasets into a single dataset
See cache error

AlreadyExistsError: There appears to be a concurrent caching iterator running - cache lockfile already exists

Describe the expected behavior Only cache the dataset once and not throw an error.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

import tensorflow as tf


def map_file_to_xy_dataset(filename, params):
    # Generate a dataset from the filename
    csv_dataset = tf.data.experimental.CsvDataset(...)

    # Cache at unique file location (i.e filename)
    cached_dataset = csv_dataset.cache(...)

    # Generate x dataset (all but id column)
    x_dataset = cached_dataset.map(lambda *x: x[0:-1])

    # Generate y dataset (last column of each file)
    y_dataset = cached_dataset.map(lambda *x: x[-1])

    # ZIP the X, Y datasets 
    # This will throw AlreadyExistsError 
    xy_dataset = tf.data.Dataset.zip((x_dataset, y_dataset))

    return xy_dataset


def generate_dataset():
    filenames_dataset = tf.data.Dataset.list_files(data_glob)

    dataset = filenames_dataset.flat_map(map_file_to_xy_dataset)

    return dataset

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (8 by maintainers)

Most upvoted comments

This behavior is expected because your input pipeline definition is trying to create the same cache twice (once for each component of the zip) and it will not be reused.

Instead, you should do the following (which will also be more efficient):

def map_file_to_dataset(filename, params):
    # Generate a dataset from the filename
    dataset = tf.data.experimental.CsvDataset(...)

    return dataset.cache(...).map(lambda *x: (x[0:-1], x[-1]))

jsimsa on Jul 3, 2019

@gadagashwini @jsimsa I am working on the fix. The issue can be reproduced by the simplified code below:

  csv_dataset = tf.data.Dataset.range(0, 1000, 1)
  cached_dataset = csv_dataset.cache("cache_test")
  x_dataset = cached_dataset.map(lambda x: x)
  y_dataset = cached_dataset.map(lambda x: x * 2)
  xy_dataset = tf.data.Dataset.zip((x_dataset, y_dataset))

  for x, y in xy_dataset:
    print(x, y)

@devstein The workaround solution is to change cached_dataset = csv_dataset.cache("cache_test") to be cached_dataset = csv_dataset.cache() which will use the memory cache instead of file cache.

feihugis on Jul 3, 2019