kubernetes: Endpoints doesn't reconcile in some cases from instance that is shutting down

What happened?

Hi,

At the time of shutdown process of an control-plane instance, apiserver attemps to remove it’s master Lease-object from etcd, and it may not find the lease-object in storage if the leases already expire by the time APIServer attempts to delete it from storage here , in such cases it will error out here with following error depending on the k8s version it’s on

controller.go:184] StorageError: key not found, Code: 1, Key: /registry/masterleases//172.18.100.37, 

because of this line here errors out to delete from storage

And endpoint won’t be reconciled during the shutdown process from behind the Kubernetes service backend. Due to this some clients may still attempt connecting to old instance for some period of time as kube-proxy won’t update the endpoints until new controller reconciler loop from another instances kicks off and removes the old endpoint.

Example of connection refused errors that clients experience on client applications because of the above issue

no more retries error: unable to recognize \""/tmp/manifest.yaml\"": Get \""https://10.100.0.1:443/api?timeout=32s\"": dial tcp 10.100.0.1:443: connect: connection refused"",""object"":{""apiVersion"":""v1"",""count"":1,""eventTime"":null,""firstTimestamp"":""2022-11-09T22:25:21Z"",""involvedObject"":

What did you expect to happen?

Given there can be potential cases where our shutdown can take longer or don’t end gracefully, our shutdown process can be more resilient and tolerate errors from storage and continue with reconciling anyway IIUC, so I propose a fix here , basically swallow/log this error when you have issues deleting a lease-object from storage and continue with reconciling to update endpoint object.

^^^ what this gives us is, incases when there was no expiry of master-leaseobject from etcd, it will be a no-op and continue to work the way it is i.e remove it from storage successfully and then reconcile. But in cases of master-lease object from etcd expires, it will fail to find the key in etcd like in this case but because of this potential fix(swallowing the error and logging it), it will continue to reconcile the endpoints and this code here will have updated master endpoints from etcd anyway at the time of reconciliation and help keep endpoints behind K8s service up-to-date during the shutdown process and doesn’t have to wait until next periodic reconciler/new instance controller reconciler loop kicks in.

If you like above proposed solution, I can submit a PR for this. Please let me know. Thank you.

How can we reproduce it (as minimally and precisely as possible)?

  1. Let clients continue to make new connections every sec to instance you are about to terminate
  2. Delay the shutdown process until master-lease object in etcd expires for a given instance
  3. Let APIServer continue with shutdown down process so it fails with the error
controller.go:184] StorageError: key not found, Code: 1, Key: /registry/masterleases//172.18.100.37, 
  1. Check error in client logs saying connection refused because it was trying to connect to old instance (tcp dump may help here to help look at it clearly that it was making a connection to IP/endpoint of old instance and failing because connection refused error at network layer.

Anything else we need to know?

No response

Kubernetes version

1.20 and 1.23 I have seen this issue happening ```console $ kubectl version ``` Client Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.15-eks-6d3986b", GitCommit:"d4be14f563712c4e1964fe8a4171ca353b6e7e1a", GitTreeState:"clean", BuildDate:"2022-07-20T22:07:15Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"} ``` ``` Client Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.12-eks-1558457", GitCommit:"5e45ab4a299ac2f7c64cbc46f285b535ee32bdb8", GitTreeState:"clean", BuildDate:"2022-10-06T21:29:14Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"} ```

But i think this may happen on any version as this part of the code hasn’t been updated on master IIUC

Cloud provider

AWS but can likely happen anywhere

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Thanks @aojea for effectively debugging this case. Yes - it’s exactly this PR that broke that.

actually all the credit must go to @hakuna-matatah , great report and he was the one identifying the PR and the double slash problem

I feel like we should not block APIServer instance from removing it’s endpoint from k8s service on any issues relating to storage failures like not finding key, unable to connect to etcd due to some network glitch etc. Do you agree on that ?

🤔 the others apiserver will recycle the endpoints too, and 5 seconds to expire the lease meanwhile you clean the endpoints sounds too much… I don’t know I don’t like the idea of having an apiserver with network problems rewriting the endpoints…

Maybe I’m very cautious with these things, I would not be in favor of doing changes like this without a test reproducing the problem or showing the improvements …

I will defer to others, I expressed my opinion but it will be good to hear others

/assign @wojtek-t @lavalamp

Thoughts ?