kubernetes: Pod graceful deletion hangs if the kubelet already rejected that pod at admission time

What happened?

We observed the following series of events:

  1. Pod is created with some attributes that later result in the pod being rejected at kubelet admission time
  2. The pod is scheduled to a node
  3. The pod is gracefully deleted by a controller, i.e. a deletion timestamp is set on the pod
  4. The pod is rejected at kubelet admission time
  5. The pod never gets deleted and is stuck: despite having the deletion timestamp set. It seems that kubelet does not issue the final force delete for the pod.

This is problematic in many scenarios, but one specific case it was hit was when the pod is backed by a daemonset controller. This issue manifested in way where since the pod never got deleted in step 5 above, the daemonset controller did not create a new pod to run the on the node and thus node was was not running a replica of the daemonset controller pod.

This seems to a be regression in k8s 1.27, I was not able to repro this behavior in 1.26.

What did you expect to happen?

It is expected at during the series of events above, after the pod is failed at kubelet admission time, since the pod has a deletion timestamp set, the kubelet should forcefully delete the pod. The pod should not hang in deletion forever.

How can we reproduce it (as minimally and precisely as possible)?

Create a 1.27 kind cluster:

kind delete cluster


kind_config="$(cat << EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  ipFamily: ipv4
nodes:
# the control plane node
- role: control-plane
- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "4"
        read-only-port: "10255"
EOF
)"

kind create cluster --config <(printf '%s\n' "${kind_config}") --image  kindest/node:v1.27.1@sha256:b7d12ed662b873bd8510879c1846e87c7e676a79fefc93e17b2a52989d3ff42b

Stop kubelet

$ docker exec -it kind-worker /bin/bash
root@kind-worker:/# systemctl stop kubelet

Create a pod bound to the node which will trigger a admission error. The below node selector label does not exist, so we expect that the pod should be rejected in admission.

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-sleep-2
spec:
  nodeSelector:
    this-label: does-not-exist
  nodeName: "kind-worker"
  containers:
  - name: ubuntu
    image:  gcr.io/google-containers/ubuntu:14.04
    command: ["/bin/sleep"]
    args: ["infinity"]
    resources:
      requests:
        cpu: "1"
        memory: "10Mi"
      limits:
        cpu: "1"
        memory: "10Mi"

While kubelet is down issue a graceful deletion to set a deletion timestamp on the pod

kubectl delete pod ubuntu-sleep-2

Verify deletion timestamp is set (pod is pending)

$ k get pod ubuntu-sleep-2 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"ubuntu-sleep-2","namespace":"default"},"spec":{"containers":[{"args":["infinity"],"command":["/bin/sleep"],"image":"gcr.io/google-containers/ubuntu:14.04","name":"ubuntu","resources":{"limits":{"cpu":"1","memory":"10Mi"},"requests":{"cpu":"1","memory":"10Mi"}}}],"nodeName":"kind-worker","nodeSelector":{"cloud.google.com/gke-nodepool":"default-pool"}}}
  creationTimestamp: "2023-06-05T18:44:09Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2023-06-05T18:44:58Z" <---- deletion timestamp is set
  name: ubuntu-sleep-2
  namespace: default
  resourceVersion: "570"
  uid: 5a8cbd76-f425-46a3-bffb-4cc4eec12557
spec:
  containers:
  - args:
    - infinity
    command:
    - /bin/sleep
    image: gcr.io/google-containers/ubuntu:14.04
    imagePullPolicy: IfNotPresent
    name: ubuntu
    resources:
      limits:
        cpu: "1"
        memory: 10Mi
      requests:
        cpu: "1"
        memory: 10Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5c69j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: kind-worker
  nodeSelector:
    cloud.google.com/gke-nodepool: default-pool
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-5c69j
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  phase: Pending
  qosClass: Guaranteed

Start kubelet

$ docker exec -it kind-worker /bin/bash
root@kind-worker:/# systemctl start kubelet

Observe pod status, the pod is now in failed phase (as expected).


$ k get pod ubuntu-sleep-2 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"ubuntu-sleep-2","namespace":"default"},"spec":{"containers":[{"args":["infinity"],"command":["/bin/sleep"],"image":"gcr.io/google-containers/ubuntu:14.04","name":"ubuntu","resources":{"limits":{"cpu":"1","memory":"10Mi"},"requests":{"cpu":"1","memory":"10Mi"}}}],"nodeName":"kind-worker","nodeSelector":{"cloud.google.com/gke-nodepool":"default-pool"}}}
  creationTimestamp: "2023-06-05T18:44:09Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2023-06-05T18:44:58Z"
  name: ubuntu-sleep-2
  namespace: default
  resourceVersion: "653"
  uid: 5a8cbd76-f425-46a3-bffb-4cc4eec12557
spec:
  containers:
  - args:
    - infinity
    command:
    - /bin/sleep
    image: gcr.io/google-containers/ubuntu:14.04
    imagePullPolicy: IfNotPresent
    name: ubuntu
    resources:
      limits:
        cpu: "1"
        memory: 10Mi
      requests:
        cpu: "1"
        memory: 10Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5c69j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: kind-worker
  nodeSelector:
    cloud.google.com/gke-nodepool: default-pool
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-5c69j
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  message: 'Pod was rejected: Predicate NodeAffinity failed'
  phase: Failed
  reason: NodeAffinity
  startTime: "2023-06-05T18:44:52Z"

However, the kubectl delete pod hangs, the pod is never force deleted by kubelet.

Anything else we need to know?

No response

Kubernetes version

1.27

Cloud provider

n/a

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 27 (24 by maintainers)

Commits related to this issue

Most upvoted comments

I see, going to open a dedicated DS Issue to move the discussion on priority and the shape of the fix there. Let’s focus here on the Kubelet.

Hi, are there any workarounds in 1.27?

The fix was cherry-picked for the 1.27 branch in https://github.com/kubernetes/kubernetes/pull/118841, so it should be included in 1.27.4.

/retitle Pod graceful deletion hangs if the kubelet already rejected that pod at admission time

There are 2 admission phases for Pods (API server admission, kubelet admission). This issue is about the second phase only.

Should it be? I think right now our answer is no - but if we moved admission closer to the pod worker, we’d say yes? Pros and cons? We start pod workers for terminating a pod - is rejecting a pod a type of rejection?

Long term this might be the best option - having every path go through pod worker and the regular pod termination (i.e. syncTerminating/syncTerminatedPod instead of special casing special situations like admission rejection, may be simpler to reason about.

For the short term, why wouldn’t HandlePodCleanups manage this case? It’s the “resync” action for the kubelet as controller, and so it’s nominally responsible for forcing all other components (pod worker, status manager, runtime) to return to state. Should it dispatch a TerminatePod call for these rejected pods that are not yet deleted, not known to the pod worker, and not known to the status manager?

Adding this to HandlePodCleanups is definitely one option and I think we could expand it to cover this case.

The only issue I see handling it there, is there could potentially be a large number of pods that are in terminal phase, not have a deletion timestamp set, and unknown pod worker and pod runtime. For each of these pods, we would have to call TerminatePod in every iteration of HandlePodCleanups until someone eventually set the deletion timestamp and the pod would be deleted. Looking at the code, it should short-circuit here, but we would need to confirm.

Me and @mimowo were also considering two other options:

  1. Call TerminatePod during kubelet rejection. I think it should be fine overall, but has a risk that after a kubelet restart, if a previous pod was still in runtime and it’s running, could lead to the pod being deleted by status manager before the orphaned pod is terminated. https://github.com/kubernetes/kubernetes/pull/118599
  2. Call syncPodKill by pod worker during kubelet rejection. It should go through pod worker and check that all containers are terminated, so it should avoid issue of the first approach. https://github.com/kubernetes/kubernetes/pull/118614

It sounds like option 2 or adding this HandlePodCleanups is the optimal paths forward.

Curious to get your thoughts on the options here.