kubernetes: Pods fail with "NodeAffinity failed" after kubelet restarts

What happened:

The issue is basically same as https://github.com/kubernetes/kubernetes/issues/92067.

With the fix https://github.com/kubernetes/kubernetes/pull/94087 in place, kubelet will node lister to sync in GetNode().

However, in the case of kubelet restart, the pods scheduled on the node before the restart might still fail with “NodeAffinity failed” after the restart. Looking at the code, this is probably because the admit pod check (canAdmitPod()) might happen before GetNode().

What you expected to happen:

After kubelet restart, old pods (pods scheduled on the node before the restart) do not see “NodeAffinity failed”.

How to reproduce it (as minimally and precisely as possible):

This issue does not happen all the time. To reproduce it, you will need to keep restarting the kubelet, and you might see a previously running Pod started to fail with “Predicate NodeAffinity failed”.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 10
  • Comments: 28 (16 by maintainers)

Most upvoted comments

on GKE v1.19.11-gke.2101 is reproducible as well, please @phantooom consider the re-opening

issue is still reproduced in GKE 1.19.8-gke.1600

sounds like something that regressed after the node sync changes, but the second one (that i did) did not fix it.

the change that i did: https://github.com/kubernetes/kubernetes/pull/99336

was technically a refactor on what was already established by the previous change: https://github.com/kubernetes/kubernetes/pull/94087

However, in the case of kubelet restart, the pods scheduled on the node before the restart might still fail with “NodeAffinity failed” after the restart. Looking at the code, this is probably because the admit pod check (canAdmitPod()) might happen before GetNode().

this seems racy and should be brought to discussion at the SIG Node meeting https://github.com/kubernetes/community/tree/master/sig-node#meetings

kubelet maintainers that are more savvy must be able to reproduce it:

This issue does not happen all the time. To reproduce it, you will need to keep restarting the kubelet, and you might see a previously running Pod started to fail with “Predicate NodeAffinity failed”.

we have a lot of GKE reporters in this ticket. has anyone seen the problem on non-GKE clusters?

After upgrading GKE to v1.18.19-gke.1700 I experienced the same issue - some of the pods after node preemption moved to NodeAffinity status

kubectl get pods -o wide --all-namespaces | grep NodeAffinity
app              app-cd5d5595f-tkw9p                          0/5     NodeAffinity

It should be fixed in 1.18.19, v1.19.10, v1.20.7 and v1.21.1.

For GKE upgrade, I think it should be asked in GKE service? /triage duplicate /close

Same here, as far as i know this is fixed in 1.18.19

fix in https://github.com/kubernetes/kubernetes/pull/99336#issuecomment-824441152 cherrypicked to 1.18 in https://github.com/kubernetes/kubernetes/pull/101343

also affectx up to 1.21 btw, check that PR to see the commit for each version

We got first affected by this issue after upgrading our GKE cluster from v1.17.17-gke.2800 to 1.18.17-gke.700 for pods running on pre-emptible nodes. Is this k8s 1.18+ specific?

FYI this is also present in gke 1.18.17-gke.700 , i did hope they would backport the patch since yesterday the .700 was released to stable channel but that is not the case.

Luckily, for us, this is only an issue with preemptible nodes since that is effectively a node restart

Will wait for 1.18.19 impatiently. 🤞

issue is still reproduced in GKE 1.19.8-gke.1600

same