tensorflow: Dataset.cache() Followed by Dataset.zip() Throws AlreadyExistsError

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): OSX Mojave 10.14.5
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): pipenv install --pre tensorflow==2.0.0-beta1
  • TensorFlow version (use command below): 2.0.0-beta1
  • Python version: 3.6.8
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A

1v2.0.0-beta0-16-g1d91213fe7

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

  1. Cache a dataset
  2. Derive other datasets from it
  3. Zip the derived datasets into a single dataset
  4. See cache error
AlreadyExistsError: There appears to be a concurrent caching iterator running - cache lockfile already exists

Describe the expected behavior Only cache the dataset once and not throw an error.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

import tensorflow as tf


def map_file_to_xy_dataset(filename, params):
    # Generate a dataset from the filename
    csv_dataset = tf.data.experimental.CsvDataset(...)

    # Cache at unique file location (i.e filename)
    cached_dataset = csv_dataset.cache(...)

    # Generate x dataset (all but id column)
    x_dataset = cached_dataset.map(lambda *x: x[0:-1])

    # Generate y dataset (last column of each file)
    y_dataset = cached_dataset.map(lambda *x: x[-1])

    # ZIP the X, Y datasets 
    # This will throw AlreadyExistsError 
    xy_dataset = tf.data.Dataset.zip((x_dataset, y_dataset))

    return xy_dataset


def generate_dataset():
    filenames_dataset = tf.data.Dataset.list_files(data_glob)

    dataset = filenames_dataset.flat_map(map_file_to_xy_dataset)

    return dataset


Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

This behavior is expected because your input pipeline definition is trying to create the same cache twice (once for each component of the zip) and it will not be reused.

Instead, you should do the following (which will also be more efficient):

def map_file_to_dataset(filename, params):
    # Generate a dataset from the filename
    dataset = tf.data.experimental.CsvDataset(...)

    return dataset.cache(...).map(lambda *x: (x[0:-1], x[-1]))

@gadagashwini @jsimsa I am working on the fix. The issue can be reproduced by the simplified code below:

  csv_dataset = tf.data.Dataset.range(0, 1000, 1)
  cached_dataset = csv_dataset.cache("cache_test")
  x_dataset = cached_dataset.map(lambda x: x)
  y_dataset = cached_dataset.map(lambda x: x * 2)
  xy_dataset = tf.data.Dataset.zip((x_dataset, y_dataset))

  for x, y in xy_dataset:
    print(x, y)

@devstein The workaround solution is to change cached_dataset = csv_dataset.cache("cache_test") to be cached_dataset = csv_dataset.cache() which will use the memory cache instead of file cache.