tfx: Evaluator fails with ValueError in Airflow when caching is turned off and retries are allowed

When running a pipeline in Airflow, if enable_cache is set to False, but retries are allowed, a component can run more than once and generate more than one artifact. A downstream component that expects only one artifact will fail with a ValueError because it expects only one artifact.

{{base_task_runner.py:115}} INFO - Job 750: Subtask Evaluator     (len(input_dict[constants.MODEL_KEY])))
{{base_task_runner.py:115}} INFO - Job 750: Subtask Evaluator ValueError: There can be only one candidate model, there are 2.

As a workaround, I’ve been setting enable_cache to True to avoid unexpected behavior due to retries.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 21 (10 by maintainers)

Most upvoted comments

I tried to create a reproduction of Airflow retry and similar code pasted as @casassg but with a bit more determinism:

https://gist.github.com/zhitaoli/b2d92f8ad04d98d99974513563149d33

I was able to reproduce the error for once, but after upgrading to tfx 1.0.0 the issue was shadowed by the following error stack:

https://gist.github.com/zhitaoli/7dbaaa42abd8aa78cb54d52a266cd0ee

I’ll dig a bit more with @hughmiao to see whether this is fixable.

Original error might be fixable with https://github.com/tensorflow/tfx/pull/4093 but without fixing above I cannot promise yet.

Same issue here with tfx.orchestration.kubeflow.kubeflow_dag_runner.KubeflowDagRunner