airflow: Pod logs from KubernetesPodOperator occasionally get replaced with "Task is not able to run"

Apache Airflow version: 1.10.10 Kubernetes version (if you are using kubernetes) (use kubectl version): 1.17.2

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): CentOS Linux
  • Kernel (e.g. uname -a): Linux airflow-worker-0 5.6.13-1.el7.elrepo.x86_64 #1 SMP Thu May 14 08:05:24 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:

What happened:

I run Airflow as a way to orchestrate jobs on Kubernetes using the KubernetesPodOperator. While most of the time logs appear correctly in the Airflow webserver, I do notice increasingly that the logs do not appear and instead just show a message, “Task is not able to be run”, such as in the snippet below:

*** Reading remote log from s3://*****/****/******/2020-06-30T10:00:00+00:00/1.log.
[2020-06-30 23:07:40,362] {taskinstance.py:663} INFO - Dependencies not met for <TaskInstance: ****.***** 2020-06-30T10:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2020-06-30 23:07:40,363] {logging_mixin.py:112} INFO - [2020-06-30 23:07:40,363] {local_task_job.py:91} INFO - Task is not able to be run

Unusually, when I go check what is happening on the Kubernetes cluster, the pod is actually running and emitting logs when I run a kubectl logs command. When the pod is complete, Airflow will reflect that the task has completed as well.

What you expected to happen: I expected pod logs to be printed out.

How to reproduce it: Very unfortunately, I am unsure what circumstances cause this error and am currently trying to gather evidence to replicate.

Anything else we need to know:

  • I have remote logging set to an S3 bucket.
  • I’ve noticed this issue increasingly with the 1.10.10 update, and I find this error daily.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 13
  • Comments: 45 (26 by maintainers)

Most upvoted comments

Few new findings:

  • was able to reproduce this issue for task running more than a minute by setting visibility_timeout = 60 in [celery_broker_transport_options] in airflow.cfg file.

  • Therefore, this is happening because Celery expects the task to complete within an hour and if not assigns another worker for the task, during this transition, worker uploads the logs with “Task is not able to run” to S3

  • Can see another worker getting same task from logs Received task: airflow.executors.celery_executor.execute_command[b40cacbb-9dd3-4681-8454-0e1df2dbc910] with same id seconding that Celery is assigning this task to another worker.

  • Modifying “visibility_timeout = 86400 # 1day” in airflow.cfg doesn’t resolve this issue and logs in UI are corrupted after an hour

  • Even tried “visibility_timeout = 7200 # 2 hours” in airflow.cfg but can still see this issue after an hour.

  • Seems the issue is similar to https://github.com/celery/celery/issues/5935, but according to this it should be resolved in Celery version 4.4.5 but, we still see the same issue even though Airflow 1.10.10 uses Celery version 4.4.6

( CC: @chrismclennon @dimberman )

I’m receiving the same message, when I’m running long task (duration more than 1 day) with KubernetesPodOperator.

{taskinstance.py:624} INFO - Dependencies not met for <TaskInstance: **** 2020-07-23T11:23:35+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Task finishes successfully, but Airflow status is marked as FAILED and downstream tasks are not run

@dennyac @rmanvar-indeed @chrismclennon Could y’all please +1 that ticket? Hopefully will catch the attention of a Celery maintainer. Note that one potential solution in 2.0 for this will be the ability to launch individual KubernetesExecutor tasks using the CeleryKubernetesExecutor.

Thanks @raj-manvar. I’m actually more concerned about tasks being stuck in running state for prolonged periods (beyond the specified task execution timeout), which has caused delays and cascading failures. As of now, we just manually restart the task whenever we encounter this.

I verified using 2.0.1 and not able to reproduceble, can you try it with Airflow 2.0.1 Thanks

The current work-around we have is a view plugin which fetches logs from the workers instead of S3.

We’re experiencing the same issue across Operators/Sensors (Airflow 1.10.11, CeleryExecutor with Redis backend)

For the two jobs of the same task that gets enqueued an hour apart, the first job continues to run and the logs don’t appear in the UI. The second job completes immediately because its dependencies (The first job enqueued which is still running is a dependency) are not met and you see “Task is not able to be run” in the UI logs.

If the first job fails or completes (success/fails), the task status gets updated accordingly and the logs will then be added to Airflow UI.

If the first job doesn’t complete (noticed cases where the job just hangs and not sure how worker restarts impact this), the task attempt will remain in the running state. In this scenario, task execution timeout isn’t honored, so the task can run for a really long time.

Unknowns -

  • Why is job being enqueued twice?
  • Why airflow isn’t honoring task execution timeout in this scenario?

The latter unknown is causing issues for us as tasks end up running for hours, and we have to manually intervene and restart the task.