kubernetes-client: watch should handle etcd old version exception

I am running spark on kubernetes. This is the full issue description https://issues.apache.org/jira/browse/SPARK-24266

I think the exception too old resource version: 21648111 (21653211) should be better handled in kubernetes-client instead of simply throw it to the caller because resource version is cached by kubernetes-client, not by the caller. https://github.com/fabric8io/kubernetes-client/blob/5b1a57b64c7dcc7ebeba3a7024e8615c91afaedb/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L259-L266

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 31 (11 by maintainers)

Commits related to this issue

[SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s ### What changes were proposed in this pull request? Restart the watcher when it failed with a HTTP_GONE code from t... — committed to apache/spark by stijndehaes 4 years ago
[SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s ### What changes were proposed in this pull request? Restart the watcher when it failed with a HTTP_GONE code from t... — committed to jkleckner/spark by stijndehaes 4 years ago
[SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has ... — committed to jkleckner/spark by stijndehaes 4 years ago
[SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has ... — committed to jkleckner/spark by stijndehaes 4 years ago
[SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has ... — committed to jkleckner/spark by stijndehaes 4 years ago
Update our decommissioning logic to the current upstream. (#673) [SPARK-21040][CORE] Speculate tasks which are running on decommission executors This PR adds functionality to consider the running ... — committed to holdenk/spark by holdenk 4 years ago
[SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has ... — committed to jkleckner/spark by stijndehaes 4 years ago
[SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s ### What changes were proposed in this pull request? This is a straight application of #28423 onto branch-3.0 ... — committed to apache/spark by stijndehaes 4 years ago

Most upvoted comments

@manusa one big difference is that with a watcher we can watch one single pod. This is watch spark-submit does when watching the driver, with sharedinformer I am watching all the pods. Unless there is way to watch a single pod? Anyway I guess this will use more resources then needed, unless I am mistaken and this is negligible?

stijndehaes on May 5, 2020

We implemented SharedInformers (#1384) a while back to mimic client-go’s behavior and provide an extra level of abstraction for Watch operations (Kubernetes client-go: watch.Interface vs. cache.NewInformer vs. cache.NewSharedIndexInformer? and Writing Controllers/SharedInformers)

Our implementation of SharedInformers already takes care of HTTP_GONE scenario.

If you are looking for this reconnect behavior, I would encourage using SharedInformers instead of Watch, or else use watch with your own reconnect implementation. I think providing this behavior for watch too would be duplicating a feature that’s already available in Informers.

@rohanKanojia maybe we can use this issue to provide some additional examples and documentation on different use-cases for SharedInformers. I think it’s unclear that they should be the default approach to watch resources.

manusa on May 4, 2020

@stijndehaes I took a look at #1800, is it better to add a bool flag of whether or not do re-watching automatically when receive a version change? So that we won’t break the contract of sending HTTP_GONE if resource version is old and also makes people easier when they don’t care about the problem.

chenchun on May 2, 2020

@manusa found it! You can do it like this I think:

val podInformer = informers.sharedIndexInformerFor(
      classOf[Pod],
      classOf[PodList],
      new OperationContext().withNamespace(NAMESPACE).withName(PODNAME),
      60000)

stijndehaes on May 6, 2020

@yujiantao For a simple fix, you can try comment out these lines https://github.com/fabric8io/kubernetes-client/blob/v4.0.5/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L141-L143 We’ve using it for a long time, everything is fine.

@chenchun Is this something we maybe could put into the client? That for some watches you don’t care about version problems.

stijndehaes on Apr 30, 2020