airflow: Scheduler fails to remove pods in Completed state
Apache Airflow version
2.6.3
What happened
This is a new problem that started happening overnight for us on only one of our environments. The only change introduced was to update job_heartbeat_sec
to 15
from the default of 5
.
What happens is that tasks are executed normally, but the worker pods remain behind in a Completed
state regardless of whether the task was successful or not. This is an example of the log from the scheduler pertaining to one of the tasks:
<TaskInstance: WINRM_TESTER.WINRM_ECHO manual__2023-08-15T08:01:31.206777+00:00 [scheduled]>
[2023-08-15T08:01:34.110+0000] {scheduler_job_runner.py:476} INFO - DAG WINRM_TESTER has 0/16 running and queued tasks
<TaskInstance: WINRM_TESTER.WINRM_ECHO manual__2023-08-15T08:01:31.206777+00:00 [scheduled]>
[2023-08-15T08:01:34.161+0000] {scheduler_job_runner.py:625} INFO - Sending TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2023-08-15T08:01:34.161+0000] {base_executor.py:147} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'WINRM_TESTER', 'WINRM_ECHO', 'manual__2023-08-15T08:01:31.206777+00:00', '--local', '--subdir', 'DAGS_FOLDER/health_check.py']
[2023-08-15T08:01:34.219+0000] {kubernetes_executor.py:637} INFO - Add task TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1) with command ['airflow', 'tasks', 'run', 'WINRM_TESTER', 'WINRM_ECHO', 'manual__2023-08-15T08:01:31.206777+00:00', '--local', '--subdir', 'DAGS_FOLDER/health_check.py']
[2023-08-15T08:01:34.251+0000] {kubernetes_executor.py:379} INFO - Creating kubernetes pod for job is TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1), with pod name winrm-tester-winrm-echo-d3xusoqk
[2023-08-15T08:01:34.306+0000] {scheduler_job_runner.py:677} INFO - Received executor event with state queued for task instance TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1)
[2023-08-15T08:01:34.575+0000] {scheduler_job_runner.py:703} INFO - Setting external_id for <TaskInstance: WINRM_TESTER.WINRM_ECHO manual__2023-08-15T08:01:31.206777+00:00 [queued]> to 219297
[2023-08-15T08:05:26.575+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod winrm-tester-winrm-echo-d3xusoqk
The last thing of substance I see is this “Attempting to adopt pod”, and that goes for all the tasks that are done but the worker pod remains in state Completed
.
Following the activity above, the scheduler logs are filled with:
[2023-08-15T08:05:24.816+0000] {kubernetes_executor.py:552} INFO - Found 0 queued task instances
[2023-08-15T08:05:25.121+0000] {scheduler_job_runner.py:1553} INFO - Resetting orphaned tasks for active dag runs
[2023-08-15T08:05:25.265+0000] {scheduler_job_runner.py:1576} INFO - Marked 1 SchedulerJob instances as failed
[2023-08-15T08:05:25.970+0000] {kubernetes_executor.py:912} INFO - attempting to adopt pod it-erpbut-grac-action-usage-sync-h-it-erpbut-grac-action-usage-sync-h-prbfigj4
[2023-08-15T08:05:26.027+0000] {kubernetes_executor.py:912} INFO - attempting to adopt pod sap-wmd-agwi-rsrvd-timeout-h-sap-wmd-agwi-rsrvd-timeout-h-4diptrah
[2023-08-15T08:05:26.085+0000] {kubernetes_executor.py:912} INFO - attempting to adopt pod sap-wmd-send-mail-to-releaser-sap-wmd-send-mail-to-releaser-yusyanm6
[2023-08-15T08:05:26.136+0000] {kubernetes_executor.py:912} INFO - attempting to adopt pod tableau-backup-backup-95vpqqsj
[2023-08-15T08:05:26.397+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod pow-eagle-to-popplan-export-pow-eagle-to-popplan-export-1bz9c733
[2023-08-15T08:05:26.445+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod pow-eagle-to-sap-isu-export-pow-eagle-to-sap-isu-export-ly4l1tae
[2023-08-15T08:05:26.513+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod sap-wmd-send-mail-to-releaser-sap-wmd-send-mail-to-releaser-ghqmaqfc
[2023-08-15T08:05:26.575+0000] {kubernetes_executor.py:951} INFO - Attempting to adopt pod winrm-tester-winrm-echo-d3xusoqk
And I do not see any exceptions to this (Failed to adopt pod …). Any idea what is going on here? New tasks continue to be executed, but we are left with this mess of Completed
pods.
The strange part about this is that if I kill the scheduler pod and have it spawn again, it solves the problem, ie - the pods get deleted. So far this is only happening on one of our environments, and the configuration is practically identical on both - with 2 scheduler instances.
What you think should happen instead
No response
How to reproduce
N/A
Operating System
Debian GNU/Linux 11
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Deployed on AKS with official helm chart.
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 24 (18 by maintainers)
Unfortunately I couldn’t reproduce it with minikube, I’ll test it on an EKS cluster this weekend.