tfx: {SPAN} doesn't work as expected with GCS

System information

  • Have I specified the code to reproduce the issue (Yes, No): Yes
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Vertex AI Pipeline, Vertex AI Notebook, GCS storage
  • TensorFlow version: 2.6
  • TFX Version: 1.2.0
  • Python version: 3.7
  • Python dependencies (from pip freeze output): None

Describe the current behavior First of all, I have CIFAR10 dataset in the following location

  • gs://cifar10-csp-public/cifar10/span-1/train/train.tfrecord
  • gs://cifar10-csp-public/cifar10/span-1/test/test.tfrecord

With ImportExampleGen as defined below, it failed to get the dataset from the specified pattern paths.


data_path = "gs://cifar10-csp-public"

input_config = example_gen_pb2.Input(splits=[
              example_gen_pb2.Input.Split(name='train',
                                          pattern='cifar10/span-{SPAN}/train/*'),
              example_gen_pb2.Input.Split(name='val',
                                          pattern='cifar10/span-{SPAN}/test/*')
          ])

example_gen = tfx.components.ImportExampleGen(input_base=data_path, input_config=input_config)

As inspecting the logs, it complains the files don’t exist.

OSError: No files found based on the file pattern gs://cifar10-csp-public/cifar10/span-{SPAN}/train/*

Describe the expected behavior

The expected behaviour is that ImportExampleGen can correctly retrieve the data with {SPAN} specified. As it didn’t work as expected, I have tried out the code below

data_path = "gs://cifar10-csp-public"

splits = [
  example_gen_pb2.Input.Split(name='train',pattern='span-{SPAN}/train/*'),
  example_gen_pb2.Input.Split(name='val',pattern='span-{SPAN}/test/*')
]
_, span, version = utils.calculate_splits_fingerprint_span_and_version(data_path, splits)
  
input_config = example_gen_pb2.Input(splits=[
    example_gen_pb2.Input.Split(name='train', pattern=f'span-{span}/train/*'),
    example_gen_pb2.Input.Split(name='val', pattern=f'span-{span}/test/*')
])

example_gen = tfx.components.ImportExampleGen(input_base=data_path, input_config=input_config)

With the utility function calculate_splits_fingerprint_span_and_version, it works fine now. However, I just wonder why it didn’t work in the first place. Doesn’t ImportExampleGen use calculate_splits_fingerprint_span_and_version function internally?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 76 (1 by maintainers)

Most upvoted comments

FYI, https://github.com/tensorflow/tfx/pull/4347 this PR should fix the {SPAN} for Vertex (KubeflowV2DagRunner)

General Resolver support on KFPV2DagRunner is not currently planned, we are looking into different options. Hopefully next year.

Yep, for 4, if input data can be re-organized with span info in it, you can still use {SPAN},

e.g., root/datablock-1-1/* root/datablock-1-2/* with pattern root/datablock-{SPAN}-/

Yep, those features are available in LocalDagRunner and KFPDagRunner, but KFPV2DagRunner needs that PR.

Before release, You can use our nightly packages instead of released package once that PR is merged pip install -i https://pypi-nightly.tensorflow.org/simple tfx https://github.com/tensorflow/tfx/blob/master/tfx/tools/docker/README.md

Yep you need to wait for PR merge and Release (hopefully in a month), or you need to build your own tfx packages/containers

Yeah I see, Resolver doesnt support RuntimeParameter anyways. Let me try with your suggestion, and if I can’t get through let me pleass ask your further help

we don’t have example for RangeConfig proto + runtime param, but we have an example for other proto

you just need to config ExampleGen’s range_config, Resolver is normally fixed rolling range for rolling window use case

Thanks, it seems a bug in KFPV2DagRunner (LocalDagRunner should work), let me check