airflow: Duplicate tasks invoked for a single task_id when manually invoked task details modal.

Apache Airflow version: 1.10.11

Kubernetes version (if you are using kubernetes) (use kubectl version): NA

Environment:

  • Cloud provider or hardware configuration: AWS (EC2 instances)
  • OS (e.g. from /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a): Linux airflow-scheduler-10-229-13-220 4.14.165-131.185.amzn2.x86_64 #1 SMP Wed Jan 15 14:19:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:

  • Others:

What happened:

When manually invoke a task from the task details dialog, we see the task running for approximately 22 seconds before we see the following appear in the log…

[2020-07-28 01:25:14,726] {local_task_job.py:150} WARNING - Recorded pid 26940 does not match the current pid 26751
[2020-07-28 01:25:14,728] {helpers.py:325} INFO - Sending Signals.SIGTERM to GPID 26757

The task then is killed. We notice this is accompanied with a second failure shortly afterwards that correlates to the new pid that has been written to the task_instance table.

It is interesting to note that if the task is scheduled as part of a normal dag run, or by clearing state and allowing the schedular to schedule its execution then we do not experience any issue.

We have attempted to specify task_concurrency on our operators with no effect.

What you expected to happen: We expected a single process to be spawned for the manually executed task.

How to reproduce it: Manually invoke a task via the task details dialog where that task execution is going to be longer than the heart rate interval that has been set.

The heart rate checks the pid and sees a mismatch and so kills the task.

Anything else we need to know:

We can produce this reliably if the task execution time is > than the heart rate interval.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 37 (17 by maintainers)

Most upvoted comments

We are also facing same issue using Composer 1.17.3 and Airflow 2.1.2

On 2.1.3 I saw that DAG containing sequential task flow works fine. The signal happens if DAG branches into parallel tasks. I rolled back to 2.1.2

Thank you for this thread! I am having the PID doesn’t match PID issue running Airflow 2.1.2 on the Google Cloud in Composer 1.17.1

[2021-10-01 20:10:27,295] {local_task_job.py:194} WARNING - Recorded pid 13335 does not match the current pid 834
[2021-10-01 20:10:27,304] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 834
[2021-10-01 20:10:27,316] {process_utils.py:66} INFO - Process psutil.Process(pid=834, status='terminated', exitcode=<Negsignal.SIGTERM: -15>, started='20:10:10') (834) terminated with exit code Negsignal.SIGTERM

I’m facing the same error on Airflow 2.1.3, to which I arrive looking for a solution on zombie detection mechanism just killing non-sense trivial tasks at any random time, so I cannot try 2.1.2, already tried increasing frequency on heartbeat config for both workers and scheduler. Trivial DAGs with linear workflow, no parallel tasks. Any DAG, with no particular configuration. I think probably this validation checks could at least be a warning and not act so sure on its mind as killing a task blocking any further advance turning usage impossible.

Same issue here, running Airflow 2.1.1 :

[2021-07-27 19:21:52,578] {local_task_job.py:195} WARNING - Recorded pid 656 does not match the current pid 76
[2021-07-27 19:21:52,581] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 76
[2021-07-27 19:21:52,588] {process_utils.py:66} INFO - Process psutil.Process(pid=76, status='terminated', exitcode=<Negsignal.SIGTERM: -15>, started='19:21:41') (76) terminated with exit code `Negsignal.SIGTERM`

Also, tried to clear the task state and the same issue happens.

+1 encountered same issue on 2.1.2, but mine’s worse… the non matching pid are actually matching

[2021-07-16 16:56:03,719] {{pod_launcher.py:128}} WARNING - Pod not yet started: scraper.2af2f07714944939b31cebb1473710b2
[2021-07-16 16:56:04,236] {{local_task_job.py:194}} WARNING - Recorded pid 9 does not match the current pid 9
[2021-07-16 16:56:04,237] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 9