kubernetes: 1.22 regression: removing and recreating static pod manifest leaves pod in error state
What happened:
- Started a 1.22 cluster
- Added a static pod manifest
- Observed the pod run successfully
- Removed and recreated the pod manifest
- Observed the pod enter an error state and never restart successfully
What you expected to happen:
For the pod to be re-run successfully (as in previous releases)
How to reproduce it (as minimally and precisely as possible):
# start a cluster
hack/local-up-cluster.sh
# create a static pod
echo '
kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
terminationGracePeriodSeconds: 1
containers:
- name: busybox
image: busybox
command: ["sh", "-c", "echo $RANDOM; sleep 100000"]
' > mypod.yaml
sudo cp mypod.yaml /var/run/kubernetes/static-pods/
# wait until the pod is running
watch kubectl get pods -n default
# remove and recreate the static pod
sudo rm /var/run/kubernetes/static-pods/mypod.yaml
sudo cp mypod.yaml /var/run/kubernetes/static-pods/
# observe the pod remain in an error state
watch kubectl get pods -n default
Anything else we need to know?:
Bisected to 3eadd1a9ead7a009a9abfbd603a5efd0560473cc (https://github.com/kubernetes/kubernetes/pull/102344)
Broken in 1.22, works in 1.21 and previous versions.
/sig node cc @smarterclayton @bobbypage @rphillips
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 31 (31 by maintainers)
when moving a file away from a directory and moving it back in after some time (which is effectively a static pod restart) or when renaming it, the modification time would remain the same. thus modification time would not contribute to the uniqueness of the UUID. this is consistent on Linux and Windows.
“change time” or
st_ctime
of the POSIXstat
can be considered, but it is a NO-OP on Windows as there is no “change time” concept there.another example about OSes differences is is that the kubelet periodic re-opening a file and checking contents means “access time” (
st_atime
) should be changing on Linux. but on Windows/ NTFS the property is disabled by default for performance reasons, meaning the system admin has to first enable a feature, for “access time” to change.both “access time” and “change time” are missing in Go’s
os.FileInfo
because they are not portable concepts.https://github.com/kubernetes/kubernetes/pull/104847 is now updated with what I believe is a more complete fix for this issue, but I’m still testing the assumptions in the Kubelet.
https://docs.google.com/document/d/1NJKYNgoXZKGS5la4MvGrB42-DNQSFf4JQkEcZRnrDzA/edit# is the doc that captures all of the implications here.
/assign @smarterclayton @rphillips
this seems likely to be caused by the pod cleanup being keyed by uid and not handling an overlapping delete operation and recreate operation for a pod with the same uid
the calculated uid for static pods consists of:
https://github.com/kubernetes/kubernetes/blob/edb0a72cff0e43bab72a02cada8486d562ee1cd5/pkg/kubelet/config/common.go#L57-L70
Removing and then replacing the manifest resulted in a pod with an identical computed uid.
Changing any of the manifest inputs to the pod uid calculation resulted in the second pod starting successfully:
Also, waiting for the pod deletion to complete (typically takes a few seconds), then readding the manifest resulting in an identical pod UID worked.