airflow: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state
Apache Airflow version
2.2.3 (latest released)
What happened
After upgrading Airflow to 2.2.3 (from 2.2.2) and cncf.kubernetes provider to 3.0.1 (from 2.0.3) we started to see these errors in the logs:
{"asctime": "2022-01-25 08:19:39", "levelname": "ERROR", "process": 565811, "name": "airflow.executors.kubernetes_executor.KubernetesJobWatcher", "funcName": "run", "lineno": 111, "message": "Unknown error in KubernetesJobWatcher. Failing", "exc_info": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 102, in run\n self.resource_version = self._run(\n File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 145, in _run\n for event in list_worker_pods():\n File \"/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py\", line 182, in stream\n raise client.rest.ApiException(\nkubernetes.client.exceptions.ApiException: (410)\nReason: Expired: too old resource version: 655595751 (655818065)\n"}
Process KubernetesJobWatcher-6571:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
self.resource_version = self._run(
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
for event in list_worker_pods():
File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 182, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 655595751 (655818065)
Pods are created and run to completion, but it seems the KubernetesJobWatcher is incapable of seeing that they completed. From there Airflow goes to a complete halt.
What you expected to happen
No errors in the logs and the job watcher does it’s job of collecting completed jobs.
How to reproduce
I wish I knew. Trying to downgrade the cncf.kubernetes provider to previous versions to see if it helps.
Operating System
k8s (Airflow images are Debian based)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon 2.6.0 apache-airflow-providers-cncf-kubernetes 3.0.1 apache-airflow-providers-ftp 2.0.1 apache-airflow-providers-http 2.0.2 apache-airflow-providers-imap 2.1.0 apache-airflow-providers-postgres 2.4.0 apache-airflow-providers-sqlite 2.0.1
Deployment
Other
Deployment details
The deployment is on k8s v1.19.16, made with helm3.
Anything else
This, in the symptoms, look a lot like #17629 but happens in a different place. Redeploying as suggested in that issues seemed to help, but most jobs that were supposed to run last night got stuck again. All jobs use the same pod template, without any customization.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 35 (26 by maintainers)
Commits related to this issue
- Allow deploying latest build again The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete it and d... — committed to jobteaser/circleci by cansjt 2 years ago
- Allow deploying latest build again The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete it and d... — committed to jobteaser/circleci by cansjt 2 years ago
- Allow deploying latest build again The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete it and d... — committed to jobteaser/circleci by cansjt 2 years ago
- Allow deploying latest build again (#114) The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete... — committed to jobteaser/circleci by cansjt 2 years ago
@snjypl
Like I said before, my only concern is that someone that reads this thread maybe get the false impression that watch bookmarks are needed to stop the scheduler from getting stuck in an infinite loop and some people may even get the idea that we need a new kubernets python client before we can solve this, which is not the case.
I just tried to explain that watch bookmarks are not needed to solve the infinite loop (which is what this #21087 is about IMHO) . And that watch bookmarks alone will not prevent the 410 at least the scenario that I’m personally experiencing (which I already explained).
One of the root causes would be the one that I explained, which I don’t think can’t be prevented from recurring.
well, I think a) is more important and that’s why I’m so adamant into making sure that a) is done and not delayed waiting for kubernetes client updates, etc. Since just
b) ...reduce the frequency of the error.
does not make a big difference, if my scheduler gets into infinite loop every hour or every day it does not matter that much. Both cases are unacceptable for me.I totally agree to leave it here, I think with these two last post is clear what you mean, and what I mean.
@potiuk Please keep in mind, that, as soon as
this ticket will indeed receive a flurry of comments 😃