kubernetes: Node lifecycle controller does not `markPodsNotReady` when the node `Ready` state changes from `false` to `unknown`

What happened?

When kubelet loses connect, the node goes into the unknown state. The node lifecycle controller marks the pod as not ready by the markPodsNotReady function because the health check status of the pod can not be obtained through kubelet. This feature is available only when node’s Ready state transitions from true to unknown.

However, if the node is already in the fail state (such as a containerd failure), markPodsNotReady will not take effect if the node loses its connection at this time.

https://github.com/kubernetes/kubernetes/blob/cac53883f4714452f3084a22e4be20d042a9df33/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L883-L888

In this case, the pod may accidentally remain ready, which may cause some network traffic to be accidentally forwarded to this node.

What did you expect to happen?

As long as the node loses its connection beyond grace time, MarkPodsNotReady should always work

How can we reproduce it (as minimally and precisely as possible)?

  1. Stop containerd and wait for the node Ready state to false
  2. Stop kubelet or shutdown the node and wait the node Ready state to unknown
  3. The pods which not be evicted on this node would be always ready

Anything else we need to know?

In the node lifecycle controller logic,MarkPodsNotReady is just triggered when a node goes from true state to an unknown state. The correct way is to trigger when the node becomes unknown state regardless of whether the node state was previously true

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.15", GitCommit:"1d79bc3bcccfba7466c44cc2055d6e7442e140ea", GitTreeState:"clean", BuildDate:"2022-09-22T06:03:36Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release

$ uname -a
5.4.119-1-tlinux4-0008 #1 SMP Fri Nov 26 11:17:45 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

Is this a good-first-issue. I would like to work on it?

Yes, I believe the change required is very localized.

there are 3 PRs targeting this bug, please coordinate with the reviewers to avoid duplicating efforts