kubernetes: Pod graceful deletion hangs if the kubelet already rejected that pod at admission time
What happened?
We observed the following series of events:
- Pod is created with some attributes that later result in the pod being rejected at kubelet admission time
- The pod is scheduled to a node
- The pod is gracefully deleted by a controller, i.e. a deletion timestamp is set on the pod
- The pod is rejected at kubelet admission time
- The pod never gets deleted and is stuck: despite having the deletion timestamp set. It seems that kubelet does not issue the final force delete for the pod.
This is problematic in many scenarios, but one specific case it was hit was when the pod is backed by a daemonset controller. This issue manifested in way where since the pod never got deleted in step 5 above, the daemonset controller did not create a new pod to run the on the node and thus node was was not running a replica of the daemonset controller pod.
This seems to a be regression in k8s 1.27, I was not able to repro this behavior in 1.26.
What did you expect to happen?
It is expected at during the series of events above, after the pod is failed at kubelet admission time, since the pod has a deletion timestamp set, the kubelet should forcefully delete the pod. The pod should not hang in deletion forever.
How can we reproduce it (as minimally and precisely as possible)?
Create a 1.27 kind cluster:
kind delete cluster
kind_config="$(cat << EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  ipFamily: ipv4
nodes:
# the control plane node
- role: control-plane
- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "4"
        read-only-port: "10255"
EOF
)"
kind create cluster --config <(printf '%s\n' "${kind_config}") --image  kindest/node:v1.27.1@sha256:b7d12ed662b873bd8510879c1846e87c7e676a79fefc93e17b2a52989d3ff42b
Stop kubelet
$ docker exec -it kind-worker /bin/bash
root@kind-worker:/# systemctl stop kubelet
Create a pod bound to the node which will trigger a admission error. The below node selector label does not exist, so we expect that the pod should be rejected in admission.
apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-sleep-2
spec:
  nodeSelector:
    this-label: does-not-exist
  nodeName: "kind-worker"
  containers:
  - name: ubuntu
    image:  gcr.io/google-containers/ubuntu:14.04
    command: ["/bin/sleep"]
    args: ["infinity"]
    resources:
      requests:
        cpu: "1"
        memory: "10Mi"
      limits:
        cpu: "1"
        memory: "10Mi"
While kubelet is down issue a graceful deletion to set a deletion timestamp on the pod
kubectl delete pod ubuntu-sleep-2
Verify deletion timestamp is set (pod is pending)
$ k get pod ubuntu-sleep-2 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"ubuntu-sleep-2","namespace":"default"},"spec":{"containers":[{"args":["infinity"],"command":["/bin/sleep"],"image":"gcr.io/google-containers/ubuntu:14.04","name":"ubuntu","resources":{"limits":{"cpu":"1","memory":"10Mi"},"requests":{"cpu":"1","memory":"10Mi"}}}],"nodeName":"kind-worker","nodeSelector":{"cloud.google.com/gke-nodepool":"default-pool"}}}
  creationTimestamp: "2023-06-05T18:44:09Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2023-06-05T18:44:58Z" <---- deletion timestamp is set
  name: ubuntu-sleep-2
  namespace: default
  resourceVersion: "570"
  uid: 5a8cbd76-f425-46a3-bffb-4cc4eec12557
spec:
  containers:
  - args:
    - infinity
    command:
    - /bin/sleep
    image: gcr.io/google-containers/ubuntu:14.04
    imagePullPolicy: IfNotPresent
    name: ubuntu
    resources:
      limits:
        cpu: "1"
        memory: 10Mi
      requests:
        cpu: "1"
        memory: 10Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5c69j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: kind-worker
  nodeSelector:
    cloud.google.com/gke-nodepool: default-pool
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-5c69j
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  phase: Pending
  qosClass: Guaranteed
Start kubelet
$ docker exec -it kind-worker /bin/bash
root@kind-worker:/# systemctl start kubelet
Observe pod status, the pod is now in failed phase (as expected).
$ k get pod ubuntu-sleep-2 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"ubuntu-sleep-2","namespace":"default"},"spec":{"containers":[{"args":["infinity"],"command":["/bin/sleep"],"image":"gcr.io/google-containers/ubuntu:14.04","name":"ubuntu","resources":{"limits":{"cpu":"1","memory":"10Mi"},"requests":{"cpu":"1","memory":"10Mi"}}}],"nodeName":"kind-worker","nodeSelector":{"cloud.google.com/gke-nodepool":"default-pool"}}}
  creationTimestamp: "2023-06-05T18:44:09Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2023-06-05T18:44:58Z"
  name: ubuntu-sleep-2
  namespace: default
  resourceVersion: "653"
  uid: 5a8cbd76-f425-46a3-bffb-4cc4eec12557
spec:
  containers:
  - args:
    - infinity
    command:
    - /bin/sleep
    image: gcr.io/google-containers/ubuntu:14.04
    imagePullPolicy: IfNotPresent
    name: ubuntu
    resources:
      limits:
        cpu: "1"
        memory: 10Mi
      requests:
        cpu: "1"
        memory: 10Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5c69j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: kind-worker
  nodeSelector:
    cloud.google.com/gke-nodepool: default-pool
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-5c69j
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  message: 'Pod was rejected: Predicate NodeAffinity failed'
  phase: Failed
  reason: NodeAffinity
  startTime: "2023-06-05T18:44:52Z"
However, the kubectl delete pod hangs, the pod is never force deleted by kubelet.
Anything else we need to know?
No response
Kubernetes version
1.27
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 27 (24 by maintainers)
I see, going to open a dedicated DS Issue to move the discussion on priority and the shape of the fix there. Let’s focus here on the Kubelet.
The fix was cherry-picked for the 1.27 branch in https://github.com/kubernetes/kubernetes/pull/118841, so it should be included in 1.27.4.
/retitle Pod graceful deletion hangs if the kubelet already rejected that pod at admission time
There are 2 admission phases for Pods (API server admission, kubelet admission). This issue is about the second phase only.
Long term this might be the best option - having every path go through pod worker and the regular pod termination (i.e.
syncTerminating/syncTerminatedPodinstead of special casing special situations like admission rejection, may be simpler to reason about.Adding this to
HandlePodCleanupsis definitely one option and I think we could expand it to cover this case.The only issue I see handling it there, is there could potentially be a large number of pods that are in terminal phase, not have a deletion timestamp set, and unknown pod worker and pod runtime. For each of these pods, we would have to call
TerminatePodin every iteration ofHandlePodCleanupsuntil someone eventually set the deletion timestamp and the pod would be deleted. Looking at the code, it should short-circuit here, but we would need to confirm.Me and @mimowo were also considering two other options:
TerminatePodduring kubelet rejection. I think it should be fine overall, but has a risk that after a kubelet restart, if a previous pod was still in runtime and it’s running, could lead to the pod being deleted by status manager before the orphaned pod is terminated. https://github.com/kubernetes/kubernetes/pull/118599syncPodKillby pod worker during kubelet rejection. It should go through pod worker and check that all containers are terminated, so it should avoid issue of the first approach. https://github.com/kubernetes/kubernetes/pull/118614It sounds like option 2 or adding this HandlePodCleanups is the optimal paths forward.
Curious to get your thoughts on the options here.