airflow: Scheduler fails to remove pods in Completed state

Apache Airflow version

2.6.3

What happened

This is a new problem that started happening overnight for us on only one of our environments. The only change introduced was to update job_heartbeat_sec to 15 from the default of 5.

What happens is that tasks are executed normally, but the worker pods remain behind in a Completed state regardless of whether the task was successful or not. This is an example of the log from the scheduler pertaining to one of the tasks:

<TaskInstance: WINRM_TESTER.WINRM_ECHO manual__2023-08-15T08:01:31.206777+00:00 [scheduled]>
[2023-08-15T08:01:34.110+0000] {scheduler_job_runner.py:476} INFO - DAG WINRM_TESTER has 0/16 running and queued tasks
<TaskInstance: WINRM_TESTER.WINRM_ECHO manual__2023-08-15T08:01:31.206777+00:00 [scheduled]>
[2023-08-15T08:01:34.161+0000] {scheduler_job_runner.py:625} INFO - Sending TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2023-08-15T08:01:34.161+0000] {base_executor.py:147} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'WINRM_TESTER', 'WINRM_ECHO', 'manual__2023-08-15T08:01:31.206777+00:00', '--local', '--subdir', 'DAGS_FOLDER/health_check.py']
[2023-08-15T08:01:34.219+0000] {kubernetes_executor.py:637} INFO - Add task TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1) with command ['airflow', 'tasks', 'run', 'WINRM_TESTER', 'WINRM_ECHO', 'manual__2023-08-15T08:01:31.206777+00:00', '--local', '--subdir', 'DAGS_FOLDER/health_check.py']
[2023-08-15T08:01:34.251+0000] {kubernetes_executor.py:379} INFO - Creating kubernetes pod for job is TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1), with pod name winrm-tester-winrm-echo-d3xusoqk
[2023-08-15T08:01:34.306+0000] {scheduler_job_runner.py:677} INFO - Received executor event with state queued for task instance TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1)
[2023-08-15T08:01:34.575+0000] {scheduler_job_runner.py:703} INFO - Setting external_id for <TaskInstance: WINRM_TESTER.WINRM_ECHO manual__2023-08-15T08:01:31.206777+00:00 [queued]> to 219297
[2023-08-15T08:05:26.575+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod winrm-tester-winrm-echo-d3xusoqk

The last thing of substance I see is this “Attempting to adopt pod”, and that goes for all the tasks that are done but the worker pod remains in state Completed.

Following the activity above, the scheduler logs are filled with:

[2023-08-15T08:05:24.816+0000] {kubernetes_executor.py:552} INFO - Found 0 queued task instances
[2023-08-15T08:05:25.121+0000] {scheduler_job_runner.py:1553} INFO - Resetting orphaned tasks for active dag runs
[2023-08-15T08:05:25.265+0000] {scheduler_job_runner.py:1576} INFO - Marked 1 SchedulerJob instances as failed
[2023-08-15T08:05:25.970+0000] {kubernetes_executor.py:912} INFO - attempting to adopt pod it-erpbut-grac-action-usage-sync-h-it-erpbut-grac-action-usage-sync-h-prbfigj4
[2023-08-15T08:05:26.027+0000] {kubernetes_executor.py:912} INFO - attempting to adopt pod sap-wmd-agwi-rsrvd-timeout-h-sap-wmd-agwi-rsrvd-timeout-h-4diptrah
[2023-08-15T08:05:26.085+0000] {kubernetes_executor.py:912} INFO - attempting to adopt pod sap-wmd-send-mail-to-releaser-sap-wmd-send-mail-to-releaser-yusyanm6
[2023-08-15T08:05:26.136+0000] {kubernetes_executor.py:912} INFO - attempting to adopt pod tableau-backup-backup-95vpqqsj
[2023-08-15T08:05:26.397+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod pow-eagle-to-popplan-export-pow-eagle-to-popplan-export-1bz9c733
[2023-08-15T08:05:26.445+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod pow-eagle-to-sap-isu-export-pow-eagle-to-sap-isu-export-ly4l1tae
[2023-08-15T08:05:26.513+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod sap-wmd-send-mail-to-releaser-sap-wmd-send-mail-to-releaser-ghqmaqfc
[2023-08-15T08:05:26.575+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod winrm-tester-winrm-echo-d3xusoqk

And I do not see any exceptions to this (Failed to adopt pod …). Any idea what is going on here? New tasks continue to be executed, but we are left with this mess of Completed pods.

The strange part about this is that if I kill the scheduler pod and have it spawn again, it solves the problem, ie - the pods get deleted. So far this is only happening on one of our environments, and the configuration is practically identical on both - with 2 scheduler instances.

What you think should happen instead

No response

How to reproduce

N/A

Operating System

Debian GNU/Linux 11

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Deployed on AKS with official helm chart.

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created a year ago
Comments: 24 (18 by maintainers)

Most upvoted comments

Hi @hussein-awala, did you have any luck reproducing this problem?

Unfortunately I couldn’t reproduce it with minikube, I’ll test it on an EKS cluster this weekend.

hussein-awala on Aug 23, 2023