airflow: Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Using KPO that fails on runtime turning on log_events_on_failure
, using a trivial example,
KubernetesPodOperator(
task_id=...,
image=...,
cmds=["/bin/bash", "-c", "-x"],
arguments=["eccho"], # typo
log_events_on_failure=True,
)
returns various lines of logs
[2023-12-05, 02:53:08 UTC] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 1st time calling it.
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - + eccho
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - bash: eccho: command not found
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Scheduled - Successfully assigned airflow/redacted_pod_name
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulling - Pulling image "redacted:latest"
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulled - Successfully pulled image "redacted:latest" in 142.533921ms (142.546182ms including waiting)
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Created - Created container base
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Started - Started container base
[2023-12-05, 02:53:10 UTC] {pod.py:726} INFO - Deleting pod: redacted_pod_name
[2023-12-05, 02:53:10 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
remote_pod: {'api_version': 'v1',
'kind': 'Pod',
'metadata': {'annotations': None,
... # VERY LONG JSON
}
...
that are irrelevant for the reason of failure (i.e., the runtime container exited nonzero with $STDERR)
What you think should happen instead
- Use reasonable log level for each pod event type & reason
- Normal->
logging.DEBUG
- (Warning, Failed)->
logging.ERROR
- Normal->
- Print pod runtime stderr to
logging.ERROR
How to reproduce
See above.
Operating System
Debian Bookworm
Versions of Apache Airflow Providers
apache-airflow==2.6.2
apache-airflow-providers-amazon==8.2.0
apache-airflow-providers-cncf-kubernetes==7.1.0
apache-airflow-providers-common-sql==1.5.1
apache-airflow-providers-ftp==3.4.1
apache-airflow-providers-http==4.4.1
apache-airflow-providers-imap==3.2.1
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-sqlite==3.4.1
Deployment
Docker-Compose
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 16 (16 by maintainers)
@ketozhang Thank you so much. There were problem in RBAC, I updated it and now I can get Pod events
Hi @ketozhang Implemented a solution here https://github.com/apache/airflow/pull/37944, could you please take a look.
@ketozhang Thanks! will try this shortly
Thanks 🚀 I was also thinking about that but wasn’t sure.
A pod event failure happens on the Kubernetes side when a Pod fails. An example of a failure is when the pod requested resource is much higher than what’s available in the cluster.
In your test, your pod succeeded to launch, successfully ran the Python code despite Python raising an Exception. This is not a Pod failure since it did what it’s suppose to do (run the Python code).
You can try various realistic scenarios like requesting a large number of CPU and RAM.
Perhaps others here can help point you to any existing testing cases that demonstrate a Pod failure either with real or mocked scenarios.
Assigned you 😃
Hi, I am an MLH fellowship intern at Airflow. I would like to work on this issue.