tfx: TFX1.14.0 causing Google Cloud Dataflow jobs to fail

System information

  • Have I specified the code to reproduce the issue: Yes
  • Environment in which the code is executed: Google Cloud Dataflow
  • TensorFlow version: 2.13.0
  • TFX Version: 1.14.0
  • Python version: 3.8.10
  • Python dependencies: Docker Image

Describe the current behavior When running the BigQueryExampleGen component on Google Cloud Dataflow using TFX1.14.0, the dataflow job gets stuck with the error: Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.

Describe the expected behavior It should not fail/get stuck.

Standalone code to reproduce the issue

PIPELINE_IMAGE = "tensorflow/tfx:1.14.0"
DATAFLOW_BEAM_PIPELINE_ARGS = [
    f"--project={GOOGLE_CLOUD_PROJECT}",
    "--runner=DataflowRunner",
    f"--temp_location={TEMP_LOCATION}",
    f"--region={GOOGLE_CLOUD_REGION}",
    "--disk_size_gb=50",
    "--machine_type=e2-standard-8",
    "--experiments=use_runner_v2",
    f"--subnetwork={SUBNETWORK}",
    f"‑‑experiments=use_sibling_sdk_workers",
    f"--sdk_container_image={PIPELINE_IMAGE}",
    f"--labels=group={GROUP}",
    f"--labels=team={TEAM}",
    f"--labels=project={PROJECT}",
    "--job_name=test-tfx-1-14",
]

def run():
    query = {
        "train": "SELECT 1 as one",
        "eval": "SELECT 1 as one",
        "test": "SELECT 1 as one",
    }

    input_config = example_gen_pb2.Input(
        splits=[
            example_gen_pb2.Input.Split(name="train", pattern=query["train"]),
            example_gen_pb2.Input.Split(name="eval", pattern=query["eval"]),
            example_gen_pb2.Input.Split(name="test", pattern=query["test"]),
        ]
    )

    BQ_BEAM_ARGS = [
        f"--project={GOOGLE_CLOUD_PROJECT}",
        f"--temp_location={TEMP_LOCATION}",
    ]

    example_gen = BigQueryExampleGen(
        input_config=input_config
    ).with_beam_pipeline_args(DATAFLOW_BEAM_PIPELINE_ARGS)

    metadata_config = (
        tfx.orchestration.experimental.get_default_kubeflow_metadata_config()
    )
    pipeline_operator_funcs = get_default_pipeline_operator_funcs()

    runner_config = tfx.orchestration.experimental.KubeflowDagRunnerConfig(
        kubeflow_metadata_config=metadata_config,
        tfx_image=PIPELINE_IMAGE,
        pipeline_operator_funcs=pipeline_operator_funcs
    )

    pod_labels = {
        "add-pod-env": "true",
        tfx.orchestration.experimental.LABEL_KFP_SDK_ENV: "tfx-template",
    }

    tfx.orchestration.experimental.KubeflowDagRunner(
        config=runner_config,
        pod_labels_to_attach=pod_labels
    ).run(
        pipeline=pipeline.Pipeline(
            pipeline_name=PIPELINE_NAME,
            pipeline_root=PIPELINE_ROOT,
            components=[example_gen],
            beam_pipeline_args=BQ_BEAM_ARGS
        )
    )


if __name__ == "__main__":
    logging.set_verbosity(logging.INFO)
    run()

Other info / logs The job fails after 1hr, regardless of the machine type or query used. Setting PIPELINE_IMAGE to tfx1.13.0 still fails, it currently only works on tfx.1.12.0

image

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Comments: 25 (9 by maintainers)

Most upvoted comments

This also didn’t work for me the first time I tried it. Then I realised you also need to make sure your custom image is used by Dataflow by adding f"--sdk_container_image={PIPELINE_IMAGE}", to BEAM_DATAFLOW_PIPELINE_ARGS.

@IzakMaraisTAL,

I tried the update the ENV variable in TFX dockerfile and build image but it takes forever to build because of #6468. TFX dependencies takes lot of time to install and results in installation failure. Once that issue is fixed, I will be able to integrate the environment variable in docker file and test it. Thanks.

@jonathan-lemos, Thank you for bringing this up. This should be fixed once we fix #6468 issue.

@IzakMaraisTAL, Yes, It will make more sense to add the environment variable to TFX base image to avoid these issues in future. I have to make sure that it doesn’t break any other scenarios where DockerFile is being used apart from Dataflow. Reopening this issue. We will update this thread. Thank you for bringing this up!

@singhniraj08, should this environment variable not be added to the TFX base image before the issue is closed? Is the TFX base image not intended to be used to run TFX jobs (on Vertex AI or Kubeflow)? Those TFX jobs might reasonably include with Dataflow components.

image

@singhniraj08 I have added that flag but I am still getting the same error

Dataflow job id: 2023-11-17_04_23_23-6315535713304255245