kubernetes: Last apiserver to shutdown doesn't remove its IP from k8s svc endpoints

What happened?

As part of the apiserver’s pre-shutdown sequence, we stop the lease-controller from renewing the apiserver’s etcd lease anymore and delete the lease:

https://github.com/kubernetes/kubernetes/blob/125e38c0872521bec49351415cc9fb624b046659/pkg/controlplane/controller.go#L211-L215

As part of the RemoveEndpoints call we expect that after the lease is deleted, the apiserver’s IP is also removed from the k8s service endpoints object:

https://github.com/kubernetes/kubernetes/blob/125e38c0872521bec49351415cc9fb624b046659/pkg/controlplane/reconcilers/lease.go#L315-L321

However, doReconcile isn’t removing the IP if it’s the last remaining one:

https://github.com/kubernetes/kubernetes/blob/125e38c0872521bec49351415cc9fb624b046659/pkg/controlplane/reconcilers/lease.go#L210-L215

The comment there is assuming that doReconcile is always called after renewing the lease, which is incorrect since it can also be called as part of the shutdown workflow.

What did you expect to happen?

The IP of the apiserver (even though it’s the last one remaining) should be removed from k8s svc endpoints on shutdown.

How can we reproduce it (as minimally and precisely as possible)?

Shutdown the last apiserver of the cluster and you’ll see logs like this:

I1031 17:34:49.517758      12 controller.go:181] Shutting down kubernetes service endpoint reconciler
E1031 17:34:49.534880      12 controller.go:184] no master IPs were listed in storage, refusing to erase all endpoints for the kubernetes service

And the audit logs also confirm that the IP isn’t removed from k8s endpoints object.

Anything else we need to know?

Why is the above an issue?

If the IP isn’t removed from the k8s endpoints object, clients (using in-cluster API mode) will try to continue talking/connecting to that instance even after the apiserver is dead. There are at least two issues with this:

Causes confusing i/o timeout and connection refused errors for clients, instead of a cleaner no route to host error
If the apiserver pod is restarted by kubelet (let’s say due to healthcheck failure), clients may start talking to it prematurely even without healthz/readyz passing. This is not a recommended behavior (xref) as a bunch post-startup hooks like informer sync, internal controllers, etc may not have initialized yet

/sig api-machinery

About this issue

Original URL
State: closed
Created a year ago
Comments: 24 (21 by maintainers)

Most upvoted comments

Yes, I’ll send out the fix (or find someone who wants to do it) by end of this week. Also gives some time for folks to chime in with any other concerns.

shyamjvs on Feb 15, 2023