airflow: Tasks stuck indefinitely when following container logs

Apache Airflow version

2.2.4

What happened

I observed that some workers hanged randomly after being running. Also, logs were not being reported. After some time, the pod status was on “Completed” when inspecting from k8s api, but wasn’t on Airflow, which showed “status:running” for the pod. After some investigation, the issue is in the new kubernetes pod operator and is dependant of a current issue in the kubernetes api.

When a log rotate event occurs in kubernetes, the stream we consume on fetch_container_logs(follow=True,…) is no longer being feeded.

Therefore, the k8s pod operator hangs indefinetly at the middle of the log. Only a sigterm could terminate it as logs consumption is blocking execute() to finish.

Ref to the issue in kubernetes: https://github.com/kubernetes/kubernetes/issues/59902

Linking to https://github.com/apache/airflow/issues/12103 for reference, as the result is more or less the same for end user (although the root cause is different)

What you think should happen instead

Pod operator should not hang. Pod operator could follow the new logs from the container - this is out of scope of airflow as ideally the k8s api does it automatically.

Solution proposal

I think there are many possibilities to walk-around this from airflow-side to not hang indefinitely (like making fetch_container_logs non-blocking for execute and instead always block until status.phase.completed as it’s currently done when get_logs is not true).

How to reproduce

Running multiple tasks will sooner or later trigger this. Also, one can configure a more aggressive logs rotation in k8s so this race is triggered more often.

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow==2.2.4
apache-airflow-providers-google==6.4.0
apache-airflow-providers-cncf-kubernetes==3.0.2

However, this should be reproducible with master.

Deployment

Official Apache Airflow Helm Chart

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 15 (14 by maintainers)

Commits related to this issue

Fix k8s pod.execute randomly stuck indefinitely by logs consumption (#23497) — committed to schattian/airflow by schattian 2 years ago
Fix k8s pod.execute randomly stuck indefinitely by logs consumption (#23497) (#23618) — committed to apache/airflow by schattian 2 years ago
Revert "Fix k8s pod.execute randomly stuck indefinitely by logs consumption (#23497) (#23618)" This reverts commit ee342b85b97649e2e29fcf83f439279b68f1b4d4. — committed to potiuk/airflow by potiuk 2 years ago
Revert "Fix k8s pod.execute randomly stuck indefinitely by logs consumption (#23497) (#23618)" (#23656) This reverts commit ee342b85b97649e2e29fcf83f439279b68f1b4d4. — committed to apache/airflow by potiuk 2 years ago
Merge branch 'main' of github.com:apache/airflow into 23497-tasks-stuck * 'main' of github.com:apache/airflow: Revert "Fix k8s pod.execute randomly stuck indefinitely by logs consumption (#23497) (... — committed to schattian/airflow by schattian 2 years ago
pull master (#25) * Clean up in-line f-string concatenation (#23591) * Apply specific ID collation to root_dag_id too (#23536) In certain databases there is a need to set the collation for ID f... — committed to Affirm/airflow by andrewdanks 2 years ago

Most upvoted comments

I’ve finally managed to reproduce this bug with the following DAG on composer-2.0.29-airflow-2.3.3:

import datetime

from airflow import models
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)

with models.DAG(
    dag_id="composer_sample_kubernetes_pod",
    schedule_interval=datetime.timedelta(days=1),
    start_date=YESTERDAY,
) as dag:
    timeout = 240
    iterations = 600 * 1000
    arguments = \
        'for i in {1..%(iterations)s}; do echo "$i of %(iterations)s"; done' % {'iterations': iterations}

    kubernetes_min_pod_0 = KubernetesPodOperator(
        task_id="pod-ex-minimum-0",
        name="pod-ex-minimum-0",
        cmds=["/bin/bash", "-c"],
        arguments=[arguments],
        namespace="default",
        image="gcr.io/gcp-runtimes/ubuntu_18_0_4",
        startup_timeout_seconds=timeout
    )

With this example a container prints 600K log messages and terminates very fast. Meanwhile a Kubernetes API is pulling chunks of container logs from a stream. The pulling process is much slower, and thus eventually we get a situation, when the container is terminated but we’re still pulling logs. The pulling process continues after the container termination for about 2-3 minutes. It looks to me that logs are being cached somewhere on a lower level, and once this cache gets exhausted, the stream hangs. Perhaps it should check a socket or connection status, but in practice it just hangs.

Here’s the line of code that hangs in Airflow’s side: https://github.com/apache/airflow/blob/395a34b960c73118a732d371e93aeab8dcd76275/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L232

And here’s the underlying line of code that hangs on urllib3’s side: https://github.com/urllib3/urllib3/blob/d393b4a5091c27d2e158074f81feb264c5c175af/src/urllib3/response.py#L999

If I’m right, then the source of the issue belongs to third-party libraries (Kubernetes API or urllib3). In this case the easiest solution would be checking the container status before pulling each chunk of logs from the urllib3.reponse.HTTPResponse. A more robust solution would be caching logs into a temporary storage and fetching them from this new source independently from a container life-cycle, but I’m not sure it’s possible. I’ll raise PR with an implementation of the first option.

moiseenkov on Dec 12, 2022

Cool. Assigned you 😃 !

potiuk on May 9, 2022

@potiuk sure, I will submit one one of these days.

schattian on May 9, 2022