kubernetes: kubelet (1.27.1) fails to mount a hostPath volume for a static pod
What happened?
This is a bit hard to reproduce, as it happens in a very specific setup, the problems only happens with kubelet
1.27.1, downgrading to 1.26.4 fixes the problem.
In a single-node cluster controlplane pods run as static pods, Rook/Ceph is deployed, and some PVs/PVCs backed by Ceph.
After a unclean shutdown related to some PVs backed by Ceph not being unmounted (this is not a Kubernetes), the kubelet
which comes up after a reboot doesn’t start controlplane static pods, which essentially blocks every other pod from starting (as API server is done).
After a reboot, kubelet is running, but it keeps reporting an error about static pods like:
172.20.0.2: {"ts":1683110056228.5916,"caller":"kubelet/kubelet.go:1875","msg":"Unable to attach or mount volumes for pod; skipping pod","pod":{"name":"kube-apiserver-talos-default-controlplane-1","namespace":"kube-system"},"err":"unmounted volumes=[audit secrets config], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition"}
All of the volumes are hostPath
volumes in a static pod:
volumes:
- hostPath:
path: /system/secrets/kubernetes/kube-apiserver
name: secrets
- hostPath:
path: /system/config/kubernetes/kube-apiserver
name: config
- hostPath:
path: /var/log/audit/kube
name: audit
The kubelet is stuck in this state “forever”, I tested at least for 8 hours.
If I downgrade the kubelet (without machine reboot) to 1.26.4, it immediately starts the static pods bringing up the node online.
If I try same scenario with reboot with kubelet 1.26.4, the problem doesn’t occur either.
What did you expect to happen?
Kubelet should run static pods after a reboot.
How can we reproduce it (as minimally and precisely as possible)?
It’s a bit complicated set of workloads, but it’s 100% reproducible. I could provide more logs if needed.
Anything else we need to know?
/sig node
Kubernetes version
Cloud provider
OS version
Talos Linux v1.5.0-alpha
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 5
- Comments: 26 (20 by maintainers)
Commits related to this issue
- Add test for starting kubelet with a CSI volume mounted To test https://github.com/kubernetes/kubernetes/issues/117745, restart kubelet with a CSI volume mounted *and* the API server running as a sta... — committed to jsafrane/kubernetes by jsafrane a year ago
- fix: downgrade kubernetes due to NewVolumeManagerReconstruction issue https://sysdig.com/blog/kubernetes-1-27-whats-new/ https://github.com/kubernetes/kubernetes/issues/117745 — committed to cfergs/kubernetes-homelab by cfergs a year ago
Found a reasonably simple reproducer:
rook-ceph.rbd.csi.ceph.com
in this report,csi-driver-hostpath
in my lab).Apparently, the CSI volume reconstruction tries to reach the API server, which is bad. Esp. if the API server is a static pod. Investigating further.
I did some bisecting, and it turns out the breaking change was introduced between versions:
Which traces it down to the commit 2c8f63f693d75059f03c5335394883c3349c39ce. I’m not quite sure how this commit is related (?)
I checked with this fix:
Kubernetes v1.27.1-31+33287eee36a861
, and it doesn’t solve the problem.@smira thanks for the report and the bisection… that’s super-helpful