airflow: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

Apache Airflow version

2.2.3 (latest released)

What happened

After upgrading Airflow to 2.2.3 (from 2.2.2) and cncf.kubernetes provider to 3.0.1 (from 2.0.3) we started to see these errors in the logs:

{"asctime": "2022-01-25 08:19:39", "levelname": "ERROR", "process": 565811, "name": "airflow.executors.kubernetes_executor.KubernetesJobWatcher", "funcName": "run", "lineno": 111, "message": "Unknown error in KubernetesJobWatcher. Failing", "exc_info": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 102, in run\n    self.resource_version = self._run(\n  File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 145, in _run\n    for event in list_worker_pods():\n  File \"/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py\", line 182, in stream\n    raise client.rest.ApiException(\nkubernetes.client.exceptions.ApiException: (410)\nReason: Expired: too old resource version: 655595751 (655818065)\n"}
Process KubernetesJobWatcher-6571:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 655595751 (655818065)

Pods are created and run to completion, but it seems the KubernetesJobWatcher is incapable of seeing that they completed. From there Airflow goes to a complete halt.

What you expected to happen

No errors in the logs and the job watcher does it’s job of collecting completed jobs.

How to reproduce

I wish I knew. Trying to downgrade the cncf.kubernetes provider to previous versions to see if it helps.

Operating System

k8s (Airflow images are Debian based)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon 2.6.0 apache-airflow-providers-cncf-kubernetes 3.0.1 apache-airflow-providers-ftp 2.0.1 apache-airflow-providers-http 2.0.2 apache-airflow-providers-imap 2.1.0 apache-airflow-providers-postgres 2.4.0 apache-airflow-providers-sqlite 2.0.1

Deployment

Other

Deployment details

The deployment is on k8s v1.19.16, made with helm3.

Anything else

This, in the symptoms, look a lot like #17629 but happens in a different place. Redeploying as suggested in that issues seemed to help, but most jobs that were supposed to run last night got stuck again. All jobs use the same pod template, without any customization.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 35 (26 by maintainers)

Commits related to this issue

Most upvoted comments

@snjypl

Like I said before, my only concern is that someone that reads this thread maybe get the false impression that watch bookmarks are needed to stop the scheduler from getting stuck in an infinite loop and some people may even get the idea that we need a new kubernets python client before we can solve this, which is not the case.

I just tried to explain that watch bookmarks are not needed to solve the infinite loop (which is what this #21087 is about IMHO) . And that watch bookmarks alone will not prevent the 410 at least the scenario that I’m personally experiencing (which I already explained).

i am just trying to help understand the root cause of 410 and possible ways prevent it from recurring.

One of the root causes would be the one that I explained, which I don’t think can’t be prevented from recurring.

both parts are equally important.

well, I think a) is more important and that’s why I’m so adamant into making sure that a) is done and not delayed waiting for kubernetes client updates, etc. Since just b) ...reduce the frequency of the error. does not make a big difference, if my scheduler gets into infinite loop every hour or every day it does not matter that much. Both cases are unacceptable for me.

I totally agree to leave it here, I think with these two last post is clear what you mean, and what I mean.

That’s good to know. Sadly, we had to choose: stick to 2.2.3 and be able to use many of the nice features SQLAlchemy 1.4 brings (with a few quirks, but we managed 😅) or upgrade (2.2.4 comes with a #21235). We chose the former. But that’s an entirely different issue 😁.

Ah yeah. The “quirks” are the reason we put < 1.4 in. It’s easy to handle the quirks when you are an indivdual user who is dedicated to handle it, but when you need to handle a flurrry of issues from 1000s of user who expect it to “just work” - we chose the < 1.4 😃 .

But worry not - 2.3.0 is already >= 1.4 😃. And soon(ish) it will be out.

@potiuk Please keep in mind, that, as soon as

  • 2.3.0 is out - with kubernetes-python pin lifted
  • and people start upgrading their kubernetes-python (there are very good reasons to)

this ticket will indeed receive a flurry of comments 😃