airflow: LocalTaskJob heartbeat race condition with finishing task causing SIGTERM
Apache Airflow version: 2.0.2
Environment:
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release): Ubuntu 18.04.2 LTS
- Kernel (e.g.
uname -a
): Linux datadumpprod2 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux - Install tools: Docker
What happened:
After task execution is done but process isn’t finished yet, heartbeat callback kills the process because falsely detects external change of state.
[2021-06-02 20:40:55,273] {{taskinstance.py:1532}} INFO - Marking task as FAILED. dag_id=<dag_id>, task_id=<task_id>, execution_date=20210602T104000, start_date=20210602T184050, end_date=20210602T184055
[2021-06-02 20:40:55,768] {{local_task_job.py:188}} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2021-06-02 20:40:55,770] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 2055
[2021-06-02 20:40:55,770] {{taskinstance.py:1265}} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-06-02 20:40:56,104] {{process_utils.py:66}} INFO - Process psutil.Process(pid=2055, status='terminated', exitcode=1, started='20:40:49') (2055) terminated with exit code 1
This happens more often when mini scheduler is enabled because in such case the window for race condition is bigger (time of execution mini scheduler).
What you expected to happen:
Heartbeat should allow task to finish and shouldn’t kill it.
How to reproduce it:
As it’s a race condition it happens randomly but to make it more often, you should have mini scheduler enabled and big enough database that execution of mini scheduler takes as long as possible. You can also reduce heartbeat interval to minimum.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 32 (18 by maintainers)
I see the issue has been closed, but am still experiencing the issue
@ephraimbuddy I have no dagrun_timeout