kubernetes: Some static pods fail to start on K8S 1.22 and 1.23

What happened?

In Kubernetes 1.23 we are seeing some static pods fail to start. These static pods failures are on quick remove and adds leaving the static pod not started.

We need to revert kubernetes/kubernetes#104743.

cc @gjkim42

What did you expect to happen?

All static pods should restart.

How can we reproduce it (as minimally and precisely as possible)?

diff --git a/pkg/kubelet/pod_workers_test.go b/pkg/kubelet/pod_workers_test.go
index 4028c06c292..0e852594a40 100644
--- a/pkg/kubelet/pod_workers_test.go
+++ b/pkg/kubelet/pod_workers_test.go
@@ -880,3 +880,34 @@ func Test_allowPodStart(t *testing.T) {
 		})
 	}
 }
+
+func TestUpdatePodWithQuickAddRemoveStaticPod(t *testing.T) {
+	podWorkers, _ := createPodWorkers()
+	staticPodA := newStaticPod("0000-0000-0000", "running-static-pod")
+	staticPodB := newStaticPod("0000-0000-0000", "running-static-pod")
+
+	podWorkers.UpdatePod(UpdatePodOptions{
+		Pod:        staticPodA,
+		UpdateType: kubetypes.SyncPodCreate,
+	})
+
+	podWorkers.UpdatePod(UpdatePodOptions{
+		Pod:        staticPodA,
+		UpdateType: kubetypes.SyncPodKill,
+	})
+
+	drainAllWorkers(podWorkers)
+
+	podWorkers.UpdatePod(UpdatePodOptions{
+		Pod:        staticPodB,
+		UpdateType: kubetypes.SyncPodCreate,
+	})
+
+	drainAllWorkers(podWorkers)
+
+	t.Logf("rphillips waitingToStartStaticPodsByFullName=%v", podWorkers.waitingToStartStaticPodsByFullname)
+	if status, ok := podWorkers.podSyncStatuses["0000-0000-0000"]; ok {
+		t.Logf("rphillips podSyncStatuses=%+v", status)
+	}
+	t.Logf("rphillips startedStaticPodsByFullname=%v", podWorkers.startedStaticPodsByFullname)
+}

Anything else we need to know?

No response

Kubernetes version

1.23

Cloud provider

All

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 25 (22 by maintainers)

Most upvoted comments

@rphillips

Let me align all PRs to be merged to fix this issue.

Thanks to @TeddyAndrieux. I can reproduce this bug quite easily.

  1. Make a static pod not terminating within graceTreminationSeconds.
  2. Change the static pod manifest with the same name (e.g. I simply change its .spec.resources.requests.cpu)
  3. Change the static pod multiple times again ~while the static pod is terminating(updating)~ before it restarts(while there is no running static pod with the same full name).

Then, the static pod with the same name never starts again.

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  terminationGracePeriodSeconds: 10
  containers:
  - args:
    - -c
    - trap "sleep inf" SIGINT SIGTERM SIGHUP &&  nc -k -l -p 4444
    command:
    - /bin/sh
    image: docker.io/library/busybox:latest
    imagePullPolicy: IfNotPresent
    name: test
    ports:
    - name: test
      containerPort: 4444
      hostPort: 4444
    resources:
      requests:
        cpu: 100m # change it simply
        memory: 10Mi

@TeddyAndrieux Can you explain how can we reproduce the problem? I think it is hard for me to reproduce the issue.

It was a bit tricky but I manage to reproduce by deploying a single-node and then I edit a first time the kube-apiserver static pod manifest and wait for the pod to be deleted, and once it’s deleted and not yet re-created I edit the manifest a second time then if you are (un)lucky the pod will never restart

Not a super easy way to reproduce, I agree 😄

Using this “method” I manage to reproduce several time with kubelet 1.22.5 and when I downgraded to 1.22.4 I didn’t.

@TeddyAndrieux my mistake, I just saw the 1.22 backport at https://github.com/kubernetes/kubernetes/pull/106394 - you are correct. Was going to drop a line here but you beat me to it 😃

@gjkim42 We’ll merge the one liner in our CI and see if it fixes the issue.