kubernetes: pod config readinessprobe, if kubelet restart, pod temporarily report containerNotReady,this may lead to service not available

What happened:

pod config readinessprobe, if kubelet restart, pod temporarily report containerNotReady,this may lead to service not available

What you expected to happen:

pod status not changed

How to reproduce it (as minimally and precisely as possible):

  1. create a pod with readinessprobe, and wait for pod ready
  2. restart kubelet, use kubectl watch pods

Anything else we need to know?:

when kubelet generateAPIPodStatus, will check probe cached status and worker state.

			var ready bool
			if c.State.Running == nil {
				ready = false
			} else if result, ok := m.readinessManager.Get(kubecontainer.ParseContainerID(c.ContainerID)); ok {
				ready = result == results.Success
			} else {
				// The check whether there is a probe which hasn't run yet.
				_, exists := m.getWorker(podUID, c.Name, readiness)
				ready = !exists
			}
			podStatus.ContainerStatuses[i].Ready = ready

when restart kubelet, probe cached status is not exist, now use worker exist value to set container ready status. This may lead a ready container to report as not ready

      if w.containerID.String() != c.ContainerID {
		if !w.containerID.IsEmpty() {
			w.resultsManager.Remove(w.containerID)
		}
		w.containerID = kubecontainer.ParseContainerID(c.ContainerID)
		w.resultsManager.Set(w.containerID, w.initialValue, w.pod)
		// We've got a new container; resume probing.
		w.onHold = false
	}

and also during first probe, worker’s containerID is not Exist, so worker will use initialValue to set result as first time. it also lead to report container as not ready

Environment:

  • Kubernetes version (use kubectl version): 1.19.8
  • Cloud provider or hardware configuration: HuaweiCloud
  • OS (e.g: cat /etc/os-release): CentOS 7.6
  • Kernel (e.g. uname -a): 3.10
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 3
  • Comments: 45 (37 by maintainers)

Most upvoted comments

I agree it’s a tradeoff, but the line is more between:

  • do I want a slightly slower detection of a non-ready pod in very rare cases (“very rare” because this only helps pods that were ready but got unready just around the moment that kubelet restarted, “slightly slower” because kubelet will anyway after restart probe the containers asap and report the real/current state)
  • vs. every kubelet restart will unavoidably break every singleton-pod that was so far serving happily

Sure, I cannot speak for others, but in my observation (managing ~6k k8s clusters) that tradeoff is only a theoretical one because in 99,99% the current behaviour is more harmful than helpful and makes updating/restarting kubelets a real pain.

I suggest we pick up a few probe related changes including this one for 1.28. I posted issues from my notes to the google doc: https://docs.google.com/document/d/1G5nGH97s3UTANbA5IyQ7nVIHnrLKfgVZssSYnvp_qX4/edit#heading=h.8nq06lbzy2x

This one was on the list. I will push for it in 1.28

I got here from https://github.com/kubernetes/kubernetes/issues/102367 which is similar but not identical.

I think the real winning comment here is https://github.com/kubernetes/kubernetes/issues/100277#issuecomment-929549329

To emphasize: One of four cases happens

  1. The pod was not-ready; kubelet restarted; probes proved the pod was not-ready
  2. The pod was not-ready; kubelet restarted; probes proved the pod was now ready; kubelet asserts ready
  3. The pod was ready; kubelet restarted; probes proved the pod was now not-ready; kubelet asserts not-ready
  4. The pod was ready; kubelet restarted; probes proved the pod was ready;

Given today’s behavior (kubelet asserts pod not-ready upon kubelet startup), cases 1-3 are no big deal. At worst, it took a little extra time to legitimately change state (cases 2, 3).

But case 4 is bad. Let’s look at it in more detail:

t1. The pod was ready t2. Kubelet restarted and asserted pod not-ready t3. Kubelet does some amount of work - O(seconds or more) t4. Probes prove the pod is ready t5. Kubelet finally asserts pod ready.

During the t2-t5 period, the pod will be removed from service endpoints. EndpointSlice updates will be propagated to all nodes, where iptables and ebpf programs and IPVS and so on will be changed. Cloud LBs may be reprogrammed. Services with a single endpoint (not uncommon!!!) will be disrupted. That’s a lot of work which will be undone again momentarily. For each pod on that node. Consider 100 pods on a node. That’s 200 kube API operations, plus all the watches and 2nd order work, to arrive back where we were before. It’s especially bad in large clusters where endpoint propagation is amplified by the number of nodes AND the fact that, given enough nodes, it’s fairly likely that at least one is being updated or rebooted at any moment in time.

Now let’s consider what happens if we change the behavior to “let it ride” until we have confirmed probe results.

Cases 1 and 4 are actually ideal. The state before kubelet restarted was correct. In case 2 it took a little longer than usual for the pod to be marked ready. In case 3 the pod stayed in service a little longer than it should have. This is not a huge deal and happens any time there is a “surprise” change (e.g. the node crashed).

We save a TON of wasted work and we trade the possibility of serving some errors (only if the pod legitimately became unready during the kubelet restart) for the possibility of serving some errors (if any service has only 1 pod). That seems like a good trade to me.

The counter-argument is that these single-pod services are ALWAYS susceptible to outage any time they update. Yep. That doesn’t mean we should waste all the effort to force that to happen.

So I’m asking that we reconsider this one - it seems wrong to me.

@matthyx

I think it is reasonable from the kubelet point of view to not assume the container is ready without having run a probe itself

Yes, but without probing, the kubelet also shouldn’t assume that the container is not ready. As far as I understood, the kubelet currently assumes that the container is not ready although it has no knowledge as it didn’t probe it. It’s pretty disturbing to have interruptions of healthy containers just because the kubelet restarted (sure you can have replicas, but you can have as well workload that will still face short interruptions).

I keep snoozing this for two weeks at a time, hoping SOMEONE will come back to it.

I feel very strongly that https://github.com/kubernetes/kubernetes/issues/100277#issuecomment-929522504 is an incorrect assessment. It’s not “the possibility that the cached status is wrong and a pod will be marked as ready when it’s not” vs. “the delay in marking the pod as ready”. It’s about the act of changing a pod from ready to unready without a justified reason. “I don’t know” is not more trustworthy than “let it ride”. In fact, it’s less material.

https://github.com/kubernetes/kubernetes/issues/100277#issuecomment-929549329 is correct to me.

I verified that this is still the case (not quite HEAD, but a few weeks stale build).

I don’t know what the “breaking change” alluded to would be, but I keep hearing about this case from people. IMO we need to fix it.

/cc @mrunalp @SergeyKanzhelev @bobbypage @derekwaynecarr

I don’t think I see how it is a breaking change. Can you help me understand?

The worst that I see is that a pod which ACTUALLY changes from “ready” to “unready” WHILE the kubelet is restarting would be kept in an LB slightly longer. It doesn’t hold up for me.

  1. This is a fairly tight race condition
  2. Probes which set a failureThreshold > 1 (the default is 3, so “most of them”) have an extended period already
  3. If kubelet doesn’t come back RIGHT AWAY, the pod coasts in the previous state (ready) anyway

So, either I don’t know what the breaking change is (please hit me with a clue stick) or it’s (IMO) exceedingly unlikely to matter, vs the ACTUAL harm done by pulling endpoints out of and back into load balancers for no reason.

/triage accepted

@Dingshujie do you mean that pods with containers having a startupProbe definition which were ready and running for some time will be probed by startupProbe again triggered by kubelet restart?

@szuecs I did a test on k8s 1.18, startupProbe will run again after kubelet restart, readiness probe will wait for it finished.

https://github.com/kubernetes/kubernetes/issues/118245

Is there any update for this? The target is for 1.28 but looks like no PR for this currently?

I did a test in my k8s 1.18 env.
Both of the pod ready status and container ready status will be changed during kubelet restart.

  • kubelet restart

  • kubelet mark the node status to be notready

  • node lifecycle controller mark the pod to be not ready immediately.

  • kubelet mark the pod’s container with reainessprobe to be notready

  • kubelet mark node to be ready

  • kubelet mark pod and container to be ready

So even we change the behavior of the container status not be false by default still not make things better, because node will become not ready and then pod on that node will be mark as notready.

Which is worse for your application: the possibility that the cached status is wrong and a pod will be marked as ready when it’s not, which will cause a failed request, or the delay in marking the pod as ready while ensuring it is running as expected? That is the tradeoff.

Sure, I guess no one is asking to flip the default. The ask is instead to NOT flip anything on kubelet restart but stick to the existing container status as reported/written-down in the resp. pod resource.

I think this is a feature request to check the cached status from the CRI in cases where correctness is less important than disruption. Anything else would be a breaking default behaviour change. (Feel free to file that as a separate feature request; it will probably need a small KEP.)