kubernetes: Pods stuck in Pending status - kube-scheduler 1.19.13
What happened:
After upgrading core k8s components from 1.18 to 1.19.13 we faced a problem with Pending pods. From time to time PODs(different deployments) are stuck in Pending status. From kube-scheduler logs, we see a new event that starts appearing after the upgrade. For example Pod kube-system/prometheus-thanos-frontend-86f4f8df77-h52tl **doesn't exist in informer cache**: pod "prometheus-thanos-frontend-86f4f8df77-h52tl" not found
.
However, ClusterAutoscaler still tries checking if there’s an appropriate node Pod kube-system.prometheus-thanos-frontend-86f4f8df77-h52tl marked as unschedulable can be scheduled on node ***.eu-west-1.compute.internal (based on hinting). Ignoring in scale up.
It looks like kube-scheduler just forgot about the pod, because after a couple of attempts it stop trying to schedule the pod and we see no logs/events for hours. On the other hand, CA still tries to schedule it even spin up a new nodes.
What you expected to happen:
Pending pods are assigned to nodes.
How to reproduce it (as minimally and precisely as possible):
We don’t see any pattern. It happens time to time for a completely different deployments.
Anything else we need to know?:
If we restart kube-scheduler so the leader is changed the Pending Pods are being successfully scheduled.
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.13", GitCommit:"53c7b65d4531a749cd3a7004c5212d23daa044a9", GitTreeState:"clean", BuildDate:"2021-07-15T20:53:19Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: AWS
- OS (e.g:
cat /etc/os-release
): CentOS Linux 7 (Core) - Kernel (e.g.
uname -a
): Linux ***.eu-west-1.compute.internal 3.10.0-1160.25.1.el7.x86_64 #1 SMP Wed Apr 28 21:49:45 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others: ClusterAutoscaler 1.19.1, kube-scheduler 1.19.13
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 40 (34 by maintainers)
I’m on the same team as OP. The cherry-pick fixed it. Thank you all for the quick resolution on a very tight deadline.
The fix is now in 1.19.15 😃
FYI, we are a bit tight on the deadline for 1.19 patch releases. I’m asking sig-release if we can squeeze this cherry-pick.
Yeah, I considered that, but it’s also theoretically more risky, as we don’t have soak time for such change. See https://github.com/kubernetes/kubernetes/pull/105015#issuecomment-919574618 for context.