airflow: Clearing of historic Task or DagRuns leads to failed DagRun
Apache Airflow version: 2.0.0
Environment:
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release): Amazon Linux 2
- Kernel (e.g.
uname -a
): Linux - Install tools:
- Others:
What happened:
Clearing a DagRun that has an execution_date
in the past where the time difference between now and that execution_date
is greater than the DagRun timeout leads to the DagRun failing on clear, and not running.
What you expected to happen: The DagRun should enter the Running state
I anticipate this is a bug where the existing DagRun duration is not being reset and so the new DagRun is timing out as soon as it starts.
How to reproduce it:
Create an DAG with a DagRun timeout. Trigger said DAG. Wait for DagRun timeout + 1
minutes and then clear said DagRun. New DagRun will immediately fail.
Anything else we need to know: This occurs regardless of whether the DAG/Task succeeded or failed. Also, any DagRun that has timed out can never be cleared.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 7
- Comments: 20 (7 by maintainers)
We have this problem on our cluster. I found a workaround while we don’t fix it in a newer version. The way I found is cleaning the failed tasks from the homepage/dagrun list instead of the dag tree view it self. Example in pictures: Let’s say that after cleaning this execution it started failing, if you go to the homepage and search for the dag, you will see that a execution failed: So if you click on it (yellow marked in the image) you will open the dag run list, and you will be able to see the failed one:
Then, you can select it and clear the stated. It will clear the execution and it will work as expected 🙏🏻
Probably the workflow for this action (cleaning) is different from the one that we have on the tree view UI, hope this help you guys to fix it 😃
Looks like a bug, needs fixing - added to 2.1.1 milestone
Thanks for confirming it !
Finally got around to updating Airflow and it seems to be working as expected now.
Is this or #14265 a candidate for https://github.com/apache/airflow/milestone/21?
I think the issue was is
1.10.x
, dagrun_timeout was essentially “how long as this DAG been executing”. In2.0.x
it is now “how long has it been since the scheduled start time”.I think the old behavior makes more sense, as it: