airflow: Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Using KPO that fails on runtime turning on log_events_on_failure, using a trivial example,

KubernetesPodOperator(
  task_id=...,
  image=...,
  cmds=["/bin/bash", "-c", "-x"],
  arguments=["eccho"],  # typo
  log_events_on_failure=True,
)

returns various lines of logs

[2023-12-05, 02:53:08 UTC] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 1st time calling it.
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - + eccho
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - bash: eccho: command not found
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Scheduled - Successfully assigned airflow/redacted_pod_name
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulling - Pulling image "redacted:latest"
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulled - Successfully pulled image "redacted:latest" in 142.533921ms (142.546182ms including waiting)
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Created - Created container base
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Started - Started container base
[2023-12-05, 02:53:10 UTC] {pod.py:726} INFO - Deleting pod: redacted_pod_name
[2023-12-05, 02:53:10 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
remote_pod: {'api_version': 'v1',
 'kind': 'Pod',
 'metadata': {'annotations': None,
  ... # VERY LONG JSON
}
...

that are irrelevant for the reason of failure (i.e., the runtime container exited nonzero with $STDERR)

What you think should happen instead

  • Use reasonable log level for each pod event type & reason
    • Normal->logging.DEBUG
    • (Warning, Failed)->logging.ERROR
  • Print pod runtime stderr to logging.ERROR

How to reproduce

See above.

Operating System

Debian Bookworm

Versions of Apache Airflow Providers

apache-airflow==2.6.2
apache-airflow-providers-amazon==8.2.0
apache-airflow-providers-cncf-kubernetes==7.1.0
apache-airflow-providers-common-sql==1.5.1
apache-airflow-providers-ftp==3.4.1
apache-airflow-providers-http==4.4.1
apache-airflow-providers-imap==3.2.1
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-sqlite==3.4.1

Deployment

Docker-Compose

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

@ketozhang Thank you so much. There were problem in RBAC, I updated it and now I can get Pod events

[2024-02-15, 23:26:26 UTC] {pod.py:980} ERROR - (403)
[2024-02-15, 23:26:26 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
[2024-02-15, 23:26:26 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 2 for task minimal_test_fail_task_without_cr (Pod minimal-test-fail-task-without-cr-rp9u952i returned a failure.
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Scheduled - Successfully assigned default/minimal-test-fail-task-without-cr-kfnmnrcd to kind-worker
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Pulling - Pulling image "alpine:latest"
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Pulled - Successfully pulled image "alpine:latest" in 566ms (566ms including waiting)
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Created - Created container base
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Started - Started container base
[2024-02-16, 00:45:26 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
[2024-02-16, 00:45:26 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 2 for task minimal_test_fail_task_without_cr (Pod minimal-test-fail-task-without-cr-kfnmnrcd returned a failure.

Hi @ketozhang Implemented a solution here https://github.com/apache/airflow/pull/37944, could you please take a look.

@ketozhang Thanks! will try this shortly

Thanks 🚀 I was also thinking about that but wasn’t sure.

A pod event failure happens on the Kubernetes side when a Pod fails. An example of a failure is when the pod requested resource is much higher than what’s available in the cluster.

In your test, your pod succeeded to launch, successfully ran the Python code despite Python raising an Exception. This is not a Pod failure since it did what it’s suppose to do (run the Python code).

You can try various realistic scenarios like requesting a large number of CPU and RAM.

Perhaps others here can help point you to any existing testing cases that demonstrate a Pod failure either with real or mocked scenarios.

Assigned you 😃

Hi, I am an MLH fellowship intern at Airflow. I would like to work on this issue.