kubernetes: 1.22 regression: removing and recreating static pod manifest leaves pod in error state

What happened:

  • Started a 1.22 cluster
  • Added a static pod manifest
  • Observed the pod run successfully
  • Removed and recreated the pod manifest
  • Observed the pod enter an error state and never restart successfully

What you expected to happen:

For the pod to be re-run successfully (as in previous releases)

How to reproduce it (as minimally and precisely as possible):

# start a cluster
hack/local-up-cluster.sh

# create a static pod
echo '
kind: Pod
apiVersion: v1
metadata:
  name: mypod
spec:
  terminationGracePeriodSeconds: 1
  containers:
  - name: busybox
    image: busybox
    command: ["sh", "-c", "echo $RANDOM; sleep 100000"]
' > mypod.yaml

sudo cp mypod.yaml /var/run/kubernetes/static-pods/

# wait until the pod is running
watch kubectl get pods -n default

# remove and recreate the static pod
sudo rm /var/run/kubernetes/static-pods/mypod.yaml
sudo cp mypod.yaml /var/run/kubernetes/static-pods/

# observe the pod remain in an error state
watch kubectl get pods -n default

Anything else we need to know?:

Bisected to 3eadd1a9ead7a009a9abfbd603a5efd0560473cc (https://github.com/kubernetes/kubernetes/pull/102344)

Broken in 1.22, works in 1.21 and previous versions.

/sig node cc @smarterclayton @bobbypage @rphillips

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 31 (31 by maintainers)

Most upvoted comments

How about including file modification time when calculating the static pod’s uid. It would prevent this issue, and calculated uid would be stable over kubelet restarts.

when moving a file away from a directory and moving it back in after some time (which is effectively a static pod restart) or when renaming it, the modification time would remain the same. thus modification time would not contribute to the uniqueness of the UUID. this is consistent on Linux and Windows.

“change time” or st_ctime of the POSIX stat can be considered, but it is a NO-OP on Windows as there is no “change time” concept there.

another example about OSes differences is is that the kubelet periodic re-opening a file and checking contents means “access time” (st_atime) should be changing on Linux. but on Windows/ NTFS the property is disabled by default for performance reasons, meaning the system admin has to first enable a feature, for “access time” to change.

both “access time” and “change time” are missing in Go’s os.FileInfo because they are not portable concepts.

https://github.com/kubernetes/kubernetes/pull/104847 is now updated with what I believe is a more complete fix for this issue, but I’m still testing the assumptions in the Kubelet.

this seems likely to be caused by the pod cleanup being keyed by uid and not handling an overlapping delete operation and recreate operation for a pod with the same uid

the calculated uid for static pods consists of:

  • node name
  • file name
  • pod content

https://github.com/kubernetes/kubernetes/blob/edb0a72cff0e43bab72a02cada8486d562ee1cd5/pkg/kubelet/config/common.go#L57-L70

Removing and then replacing the manifest resulted in a pod with an identical computed uid.

Changing any of the manifest inputs to the pod uid calculation resulted in the second pod starting successfully:

  • removing the first manifest and creating an identical manifest with a different filename
  • removing the first manifest and creating a manifest with the same name with a slightly different pod (e.g. a random annotation inserted)

Also, waiting for the pod deletion to complete (typically takes a few seconds), then readding the manifest resulting in an identical pod UID worked.