tfx: Evaluator fails with ValueError in Airflow when caching is turned off and retries are allowed
When running a pipeline in Airflow, if enable_cache is set to False, but retries are allowed, a component can run more than once and generate more than one artifact. A downstream component that expects only one artifact will fail with a ValueError because it expects only one artifact.
{{base_task_runner.py:115}} INFO - Job 750: Subtask Evaluator (len(input_dict[constants.MODEL_KEY])))
{{base_task_runner.py:115}} INFO - Job 750: Subtask Evaluator ValueError: There can be only one candidate model, there are 2.
As a workaround, I’ve been setting enable_cache to True to avoid unexpected behavior due to retries.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 21 (10 by maintainers)
I tried to create a reproduction of Airflow retry and similar code pasted as @casassg but with a bit more determinism:
https://gist.github.com/zhitaoli/b2d92f8ad04d98d99974513563149d33
I was able to reproduce the error for once, but after upgrading to tfx 1.0.0 the issue was shadowed by the following error stack:
https://gist.github.com/zhitaoli/7dbaaa42abd8aa78cb54d52a266cd0ee
I’ll dig a bit more with @hughmiao to see whether this is fixable.
Original error might be fixable with https://github.com/tensorflow/tfx/pull/4093 but without fixing above I cannot promise yet.
Same issue here with tfx.orchestration.kubeflow.kubeflow_dag_runner.KubeflowDagRunner