airflow: KubernetesPodOperator breaks with active log-collection for long running tasks
I’m encountering the same bug reported in https://issues.apache.org/jira/browse/AIRFLOW-3534, with airflow 1.10.12.
[2020-11-06 13:03:29,672] {pod_launcher.py:173} INFO - Event: fetcher-56104d81c54946a88ce3cd1cf4273477 had an event of type Pending
[2020-11-06 13:03:29,673] {pod_launcher.py:139} WARNING - Pod not yet started: fetcher-56104d81c54946a88ce3cd1cf4273477
[2020-11-06 13:03:30,681] {pod_launcher.py:173} INFO - Event: fetcher-56104d81c54946a88ce3cd1cf4273477 had an event of type Pending
[2020-11-06 13:03:30,681] {pod_launcher.py:139} WARNING - Pod not yet started: fetcher-56104d81c54946a88ce3cd1cf4273477
[2020-11-06 13:03:31,692] {pod_launcher.py:173} INFO - Event: fetcher-56104d81c54946a88ce3cd1cf4273477 had an event of type Pending
[2020-11-06 13:03:31,692] {pod_launcher.py:139} WARNING - Pod not yet started: fetcher-56104d81c54946a88ce3cd1cf4273477
[2020-11-06 13:03:32,702] {pod_launcher.py:173} INFO - Event: fetcher-56104d81c54946a88ce3cd1cf4273477 had an event of type Running
[2020-11-06 13:04:32,740] {taskinstance.py:1150} ERROR - ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
Traceback (most recent call last):
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/response.py", line 696, in _update_chunk_length
self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/response.py", line 436, in _error_catcher
yield
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/response.py", line 763, in read_chunked
self._update_chunk_length()
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/response.py", line 700, in _update_chunk_length
raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 979, in _run_raw_task
result = task_copy.execute(context=context)
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 284, in execute
final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 403, in create_new_pod_for_operator
final_state, result = launcher.monitor_pod(pod=pod, get_logs=self.get_logs)
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/airflow/kubernetes/pod_launcher.py", line 155, in monitor_pod
for line in logs:
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/response.py", line 807, in __iter__
for chunk in self.stream(decode_content=True):
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/response.py", line 571, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/response.py", line 792, in read_chunked
self._original_response.close()
File "/opt/bitnami/python/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/response.py", line 454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
[2020-11-06 13:04:32,743] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=..., task_id=..., execution_date=20201106T120000, start_date=20201106T130329, end_date=20201106T130432
[2020-11-06 13:04:34,641] {local_task_job.py:102} INFO - Task exited with return code 1
The bug goes away by setting get_logs=False
in the KubernetesPodOperator. Reproduced with multiple dags and tasks.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 4
- Comments: 32 (16 by maintainers)
Commits related to this issue
- Temporarily remove logging from this DAG: You can see precedent for this solution in this issue record https://github.com/apache/airflow/issues/12136 As well as this similar change https://github.co... — committed to mozilla/telemetry-airflow by chelseatroy a year ago
- Temporarily remove logging from this DAG: (#1748) You can see precedent for this solution in this issue record https://github.com/apache/airflow/issues/12136 As well as this similar change https:/... — committed to mozilla/telemetry-airflow by chelseatroy a year ago
Can you please give me a example?
@dmateusp I found an issue with the above fix. When running with the followings arguments, only the first
print()
is logged, the later ones are discarded:Log output:
Expected log output, when using the following arguments (reduced sleep time to avoid
IncompleteRead
):@ImNiceMan
Earnest
is the name of the company that Daniel (@dmateusp) works for. He made a subclass ofKubernetesPodOperator
to patch the issue, and he put the name of the company on the front of the subclass name to indicate “this is a custom patch that we are using; not a part of the actual Kubernetes library.”So you can’t import it, but you can make your own subclass in your code that fixes the issue the same way he did.
What?
EarnestKubernetesPodOperator
does not exist in Airflow Codebase. You must have some local modifications in your code.the version of the airflow cncf provider package ->
https://pypi.org/project/apache-airflow-providers-cncf-kubernetes/3.0.0/
Hi @mikaeld, We have 2 clusters and 3 instances - latest setup I ran into it
cluster 1 (prod) airflow-2.2.2 apache-airflow-providers-cncf-kubernetes 2.1.0 -> 5 times in the last month
cluster 2 (test) airflow-2.2.3 apache-airflow-providers-cncf-kubernetes 2.2.0 - not noticed so far
cluster 2 (dev) airflow-2.2.3 apache-airflow-providers-cncf-kubernetes 2.2.0 -> one time in the last month
Yes I’m having this issue running Airflow 1.10.12 with KubernetesPodOperator which is running a queries packaged in a Docker image. We are running long-running queries where the queries will be running and producing the logs in Airflow for 2-4 hours and then at 3hrs 59 minutes of query logs we would just get:
repeated every 2 seconds for 2.5 to 3.5 hours until a failed state.
I was able to reproduce this issue after a few days of letting Bash containers run from KubernetesPodOperator, and implement a fix:
I ran two bash tasks, one with the fix (EarnestKubernetesPodOperator) and one without (KubernetesPodOperator):
I let both of these run for a few days and the one without the fix failed a few times while the one with the fix never failed:
^ First row is
bash_echo_1
and second row isbash_echo_2
In the logs of
bash_echo_1
we can see that the issue was reproduced after more than a day running:While the same issue happened in
bash_echo_2
but it was handled by the change in this PRThe only change was to add error handling around
self.read_pod_logs
in the KubernetesPodOperator:I’ll open a PR asap 🏃