airflow: Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Using KPO that fails on runtime turning on log_events_on_failure, using a trivial example,

KubernetesPodOperator(
  task_id=...,
  image=...,
  cmds=["/bin/bash", "-c", "-x"],
  arguments=["eccho"],  # typo
  log_events_on_failure=True,
)

returns various lines of logs

[2023-12-05, 02:53:08 UTC] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 1st time calling it.
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - + eccho
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - bash: eccho: command not found
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Scheduled - Successfully assigned airflow/redacted_pod_name
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulling - Pulling image "redacted:latest"
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulled - Successfully pulled image "redacted:latest" in 142.533921ms (142.546182ms including waiting)
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Created - Created container base
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Started - Started container base
[2023-12-05, 02:53:10 UTC] {pod.py:726} INFO - Deleting pod: redacted_pod_name
[2023-12-05, 02:53:10 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
remote_pod: {'api_version': 'v1',
 'kind': 'Pod',
 'metadata': {'annotations': None,
  ... # VERY LONG JSON
}
...

that are irrelevant for the reason of failure (i.e., the runtime container exited nonzero with $STDERR)

What you think should happen instead

Use reasonable log level for each pod event type & reason
- Normal->logging.DEBUG
- (Warning, Failed)->logging.ERROR
Print pod runtime stderr to logging.ERROR

How to reproduce

See above.

Operating System

Debian Bookworm

Versions of Apache Airflow Providers

apache-airflow==2.6.2
apache-airflow-providers-amazon==8.2.0
apache-airflow-providers-cncf-kubernetes==7.1.0
apache-airflow-providers-common-sql==1.5.1
apache-airflow-providers-ftp==3.4.1
apache-airflow-providers-http==4.4.1
apache-airflow-providers-imap==3.2.1
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-sqlite==3.4.1

Deployment

Docker-Compose

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 16 (16 by maintainers)

Most upvoted comments

@ketozhang Thank you so much. There were problem in RBAC, I updated it and now I can get Pod events

[2024-02-15, 23:26:26 UTC] {pod.py:980} ERROR - (403)
[2024-02-15, 23:26:26 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
[2024-02-15, 23:26:26 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 2 for task minimal_test_fail_task_without_cr (Pod minimal-test-fail-task-without-cr-rp9u952i returned a failure.
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Scheduled - Successfully assigned default/minimal-test-fail-task-without-cr-kfnmnrcd to kind-worker
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Pulling - Pulling image "alpine:latest"
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Pulled - Successfully pulled image "alpine:latest" in 566ms (566ms including waiting)
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Created - Created container base
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Started - Started container base
[2024-02-16, 00:45:26 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
[2024-02-16, 00:45:26 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 2 for task minimal_test_fail_task_without_cr (Pod minimal-test-fail-task-without-cr-kfnmnrcd returned a failure.

sudiptob2 on Feb 16, 2024

Hi @ketozhang Implemented a solution here https://github.com/apache/airflow/pull/37944, could you please take a look.

sudiptob2 on Mar 6, 2024

@ketozhang Thanks! will try this shortly

sudiptob2 on Feb 15, 2024

Thanks 🚀 I was also thinking about that but wasn’t sure.

sudiptob2 on Feb 15, 2024

A pod event failure happens on the Kubernetes side when a Pod fails. An example of a failure is when the pod requested resource is much higher than what’s available in the cluster.

In your test, your pod succeeded to launch, successfully ran the Python code despite Python raising an Exception. This is not a Pod failure since it did what it’s suppose to do (run the Python code).

You can try various realistic scenarios like requesting a large number of CPU and RAM.

Perhaps others here can help point you to any existing testing cases that demonstrate a Pod failure either with real or mocked scenarios.

ketozhang on Feb 15, 2024

Assigned you 😃

potiuk on Feb 10, 2024

Hi, I am an MLH fellowship intern at Airflow. I would like to work on this issue.

sudiptob2 on Feb 10, 2024