kubernetes: Static pods never start w Kubelet v1.19.0-alpha.1/beta.2 on OSes with SMT disabled
What happened:
Kubelet’s between v1.19.0-alpha.1 and v1.19.0-beta.1 (latest at time of writing) cannot start static pod manifests defined in --pod-manifest-path=/etc/kubernetes/manifests. Rolling back to v1.18.3 (or building v1.19.0-alpha.0) restores the ability to create static pods. Observed on Fedora CoreOS nodes, but no on Flatcar Linux nodes.
With Kubelet -v=10, this message looks suspect:
Reading config file "/etc/kubernetes/manifests/kube-apiserver.yaml"
Generated UID "3cea85470d942aa9e23a9df789f659d8" pod "kube-apiserver" from /etc/kubernetes/manifests/kube-apiserver.yaml
Generated Name "kube-apiserver-ip-10-0-14-234" for UID "3cea85470d942aa9e23a9df789f659d8" from URL /etc/kubernetes/manifests/kube-apiserver.yaml
Using namespace "kube-system" for pod "kube-apiserver-ip-10-0-14-234" from /etc/kubernetes/manifests/kube-apiserver.yaml
Receiving a new pod "kube-apiserver-ip-10-0-14-234_kube-system(3cea85470d942aa9e23a9df789f659d8)"
Write status for kube-apiserver-ip-10-0-14-234/kube-system: &container.PodStatus{ID:"3cea85470d942aa9e23a9df789f659d8", Name:"kube-apiserver-ip-10-0-14-234", Namespace:"kube-system", IPs:[]string{}, ContainerStatuses:[]*container.ContainerStatus{(*container.ContainerStatus)(0xc000ca42a0)}, SandboxStatuses:[]*v1alpha2.PodSandboxStatus{(*v1alpha2.PodSandboxStatus)(0xc00058ec60)}} (err: <nil>)
Failed to admit pod kube-apiserver-ip-10-0-14-234_kube-system(3cea85470d942aa9e23a9df789f659d8) - Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: cpu, q: 150), ]
no set of running pods found to reclaim resources: [(res: cpu, q: 150), ]
What you expected to happen:
Kublet should create static pods as containers with the Docker runtime (sudo docker ps).
How to reproduce it (as minimally and precisely as possible):
Run Kubelet on Fedora CoreOS with pod-manifest-path manifests, using the default Docker runtime. Check docker ps -a to see no containers are created.
Anything else we need to know?:
Rolling Kubelet back to v1.18.3 immediately allows static pod manifests to be created (e.g. same host, no other changes), hinting this is a Kubelet regression. Fedora CoreOS nodes are consistently affected, while Flatcar Linux nodes are not. To me this hints the issue relates in some way to interactions/assumptions about the host.
Binary searching and building Kubelets reveals the issue began in https://github.com/kubernetes/kubernetes/pull/86975
BAD v1.19.0-alpha.1 and beyond
BAD 7555985346c48b20d2b6662ebbce93827b513be2
BAD 54967fe39367c1ada4c9c4b5c2146263f85a41e4
BAD 3e43b0722a0812c7d333a4557a4c09c32e2d86c3
BAD 4274ea2c89dee24e4c188a71e8164b2a40d1e181
OK a6d0f8e3dc33d897f0fa6cc6ec325a2c333b5bda
OK d00f9c7c1091e31c75c6636500095c4e490b8db8
OK a1ae67d691d514d859fce68299d7bd3830686b38
OK v1.19.0-alpha.0
Environment:
- Kubernetes version (use
kubectl version): v1.19.0-alpha.1 to v1.19.0-beta.1 - Cloud provider or hardware configuration: Any platform / NA
- OS (e.g:
cat /etc/os-release): Fedora CoreOS 31.20200517.3.0 - Kernel (e.g.
uname -a):Linux ip-10-0-14-234 5.6.11-200.fc31.x86_64 - Install tools: Typhoon
So what actually differs between the Fedora CoreOS and Flatcar Linux hosts that’s plausibly relevant here.
| Name | Kernel | Docker | driver | problem |
|---|---|---|---|---|
| Fedora CoreOS 31.20200517.3 | 5.6.11-200 | 18.09.8 | systemd | yes |
| Flatcar Linux 2512.2.0 | 4.19.124-flatcar | 18.06.3-ce | cgroupfs | no |
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 22 (17 by maintainers)
@dghubble Thank you for information, I’ll try to provide a fix for this case today.
Ah, looks much better. @iwankgb thanks!
AWS t3.small
I have the same failure mode w/ for processor 10 on my corp (“gLinux” ~= debian testing) workstation with the patch.
On Thu, Jun 4, 2020 at 9:01 PM Dalton Hubble notifications@github.com wrote:
Great context over there, thanks! Here, I’m not able to kubectl get nodes since this prevents nodes ever registering to kube-apiserver (which doesn’t come up since its a static pod).
But it does seem cpu1 is missing the
topologydirectory, which cAdvisor seems to now want according to this comment.Fedora CoreOS (cpu1 missing topology)
Flatcar Linux (ok)
Loooks like https://github.com/google/cadvisor/pull/2567 is a candidate change to cAdvisor. I can test it somewhat crudely (running it on-host, not quite how its really used, but better than nothing) on the affected Fedora CoreOS node.