pipelines: S3 errors in Pipeline examples for reading training data and artifact storage

I am working with the Taxi Cab pipeline example and need to replace GCS storage with Minio (S3 compatible) for storing training data, eval data, and to pass data from step to step in argo workflows: “pipelines/samples/notebooks/KubeFlow Pipeline Using TFX OSS Components.ipynb”

The issue with s3:// protocol support seems to be specific to TFDV/Apache Beam step. Beam does not seem to provide support for S3 in Python client. We are looking for a way right now to change TFDV step to use local/attached storage.

Minio access parameters seem to be properly configured - the validation step is successfully creating several folders in Minio bucket, for example: demo04kubeflow/output/tfx-taxi-cab-classification-pipeline-example-ht94b/validation

The error is on reading or writing any files from the Minio buckets, and it’s coming from Tensorflow/Beam tfdv.generate_statistics_from_csv():

File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
    raise ValueError('Unable to get the Filesystem for path %s' % path)
ValueError: Unable to get the Filesystem for path s3://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio files are accessed via s3:// protocol, for example PipelineTFX4.ipynb.txt OUTPUT_DIR = ‘s3://demo04kubeflow/output’

This same step worked fine when train.csv was stored in GCS bucket: gs://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio credentials were provided as env variables to ContainerOp:

return dsl.ContainerOp(
        name = step_name,
        image = DATAFLOW_TFDV_IMAGE,
        arguments = [
            '--csv-data-for-inference', inference_data,
            '--csv-data-to-validate', validation_data,
            '--column-names', column_names,
            '--key-columns', key_columns,
            '--project', project,
            '--mode', mode,
            '--output', validation_output,
        ],
        file_outputs = {
            'schema': '/schema.txt',
        }
    ).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_ENDPOINT', 
            value=S3_ENDPOINT, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ENDPOINT_URL', 
            value='https://{}'.format(S3_ENDPOINT), 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ACCESS_KEY_ID', 
            value=S3_ACCESS_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_SECRET_ACCESS_KEY', 
            value=S3_SECRET_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_REGION', 
            value='us-east-1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='BUCKET_NAME', 
            value='demo04kubeflow', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_USE_HTTPS', 
            value='1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_VERIFY_SSL', 
            value='1'
    ))

This pipeline example was created from Jupyter notebook running on the same Kubernetes cluster as Kubeflow Pipelines, Argo, and Minio. Please see attached the Jupyter notebook, and two log files from pipeline execution (validate step). All required files (such as train.csv) were uploaded to Minio from the notebook. tfx-taxi-cab-classification-pipeline-example-wait.log

tfx-taxi-cab-classification-pipeline-example-main.log

PipelineTFX4.ipynb.zip

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 35 (17 by maintainers)

Commits related to this issue

Create a script to make updating application resources easier. (#596) * This is a replacement for https://github.com/kubeflow/manifests/blob/master/hack/update-instance-labels.sh * We don't want t... — committed to Linchin/pipelines by jlewi 4 years ago
Permit pipeline-runner to operate on runs (#596) — committed to red-hat-data-services/data-science-pipelines by pugangxa 3 years ago

Most upvoted comments

@rummens At the same time, work is in progress on a library that allows to mount any S3 bucket into pipeline steps or Jupyter notebooks. The library is based on Goofys file system, and it will be ready as soon as next week. It allows various frameworks such as Beam and Keras to access data in S3 buckets via POSIX interface. It supports S3, Minio, GCS, Ceph: https://github.com/kahing/goofys Currently working on documentation and examples.

mameshini on Jun 7, 2019

Thank you for the update @Jeffwan. NFS storage can be acceptable for demos or even real use cases where data sets are relatively small. But NFS does have limitations with cost, performance, scalability. When data sets exceed 100 GB, NFS becomes too expensive, slow, and hard to manage. Even when using managed NFS from cloud providers, we experienced very long delays when simply reading directories with a large number of files. The preferred approach would be to store data in object storage.

We have explored several ways to mount object storage as Posix-like file system, with support for S3, GCS, Minio, or Ceph. Based on our testing, Goofys had the best reliability and performance, and we decided to use it for creating PVs and PVCs for pipeline steps. Preparing to publish a working example soon.

mameshini on Apr 12, 2019

@gautamkmr Check #3185 The blocker is now on the TFX side.

Jeffwan on Apr 15, 2020

@mameshini Hi Mameshini, I am trying to store data from kubeflow to MinIO, but I’ve got an issue regarding the mounting problem. It seems that I need to modify the yaml file, but I do not know how to modify it, do you have any example? or is there any hint of how do modify it plz?

AnnieWei58 on Aug 19, 2019

@aronchick We are currently implementing a storage management approach that mounts S3 bucket as Kubernetes volume, using s3fs dynamic provisioner. It’s not just Apache Beam, it’s Keras and other libraries that can’t handle S3 or Minio. We can mount S3/GCS/Minio buckets as volumes and access them as Posix file system, with an optional caching layer. I can share working examples soon, fingers crossed performance seems acceptable. I am waiting on filing a bug because we may be able to solve this problem better with Kubernetes storage provisioners. A lot of Python libraries require a file system to work with.

mameshini on Mar 26, 2019

ack I’ll loop in some folks; but it sounds like the issue is actually outside Kubeflow and is in Apache Beam.

You might want to repost the issue in the Apache Beam https://issues.apache.org/jira/browse/BEAM-2500

Or in the TF Data Validation https://github.com/tensorflow/data-validation

If your goal is trying to use pipelines did you consider trying to use some other example? Or creating a new one that doesn’t use TFX.

jlewi on Jan 10, 2019