airflow: Clearing of historic Task or DagRuns leads to failed DagRun

Apache Airflow version: 2.0.0

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): Linux
  • Install tools:
  • Others:

What happened:

Clearing a DagRun that has an execution_date in the past where the time difference between now and that execution_date is greater than the DagRun timeout leads to the DagRun failing on clear, and not running.

What you expected to happen: The DagRun should enter the Running state

I anticipate this is a bug where the existing DagRun duration is not being reset and so the new DagRun is timing out as soon as it starts.

How to reproduce it:

Create an DAG with a DagRun timeout. Trigger said DAG. Wait for DagRun timeout + 1 minutes and then clear said DagRun. New DagRun will immediately fail.

Anything else we need to know: This occurs regardless of whether the DAG/Task succeeded or failed. Also, any DagRun that has timed out can never be cleared.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 7
  • Comments: 20 (7 by maintainers)

Most upvoted comments

We have this problem on our cluster. I found a workaround while we don’t fix it in a newer version. The way I found is cleaning the failed tasks from the homepage/dagrun list instead of the dag tree view it self. Example in pictures: Let’s say that after cleaning this execution it started failing, if you go to the homepage and search for the dag, you will see that a execution failed: image So if you click on it (yellow marked in the image) you will open the dag run list, and you will be able to see the failed one: image

Then, you can select it and clear the stated. It will clear the execution and it will work as expected 🙏🏻 image

Probably the workflow for this action (cleaning) is different from the one that we have on the tree view UI, hope this help you guys to fix it 😃

Looks like a bug, needs fixing - added to 2.1.1 milestone

Thanks for confirming it !

Finally got around to updating Airflow and it seems to be working as expected now.

I think the issue was is 1.10.x, dagrun_timeout was essentially “how long as this DAG been executing”. In 2.0.x it is now “how long has it been since the scheduled start time”.

I think the old behavior makes more sense, as it:

  1. Enforces a maximum DAG duration (intended effect)
  2. While still allowing a DAG to be re-executed