kubernetes: Pod graceful deletion hangs if the kubelet already rejected that pod at admission time
What happened?
We observed the following series of events:
- Pod is created with some attributes that later result in the pod being rejected at kubelet admission time
- The pod is scheduled to a node
- The pod is gracefully deleted by a controller, i.e. a deletion timestamp is set on the pod
- The pod is rejected at kubelet admission time
- The pod never gets deleted and is stuck: despite having the deletion timestamp set. It seems that kubelet does not issue the final force delete for the pod.
This is problematic in many scenarios, but one specific case it was hit was when the pod is backed by a daemonset controller. This issue manifested in way where since the pod never got deleted in step 5 above, the daemonset controller did not create a new pod to run the on the node and thus node was was not running a replica of the daemonset controller pod.
This seems to a be regression in k8s 1.27, I was not able to repro this behavior in 1.26.
What did you expect to happen?
It is expected at during the series of events above, after the pod is failed at kubelet admission time, since the pod has a deletion timestamp set, the kubelet should forcefully delete the pod. The pod should not hang in deletion forever.
How can we reproduce it (as minimally and precisely as possible)?
Create a 1.27 kind cluster:
kind delete cluster
kind_config="$(cat << EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
ipFamily: ipv4
nodes:
# the control plane node
- role: control-plane
- role: worker
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
v: "4"
read-only-port: "10255"
EOF
)"
kind create cluster --config <(printf '%s\n' "${kind_config}") --image kindest/node:v1.27.1@sha256:b7d12ed662b873bd8510879c1846e87c7e676a79fefc93e17b2a52989d3ff42b
Stop kubelet
$ docker exec -it kind-worker /bin/bash
root@kind-worker:/# systemctl stop kubelet
Create a pod bound to the node which will trigger a admission error. The below node selector label does not exist, so we expect that the pod should be rejected in admission.
apiVersion: v1
kind: Pod
metadata:
name: ubuntu-sleep-2
spec:
nodeSelector:
this-label: does-not-exist
nodeName: "kind-worker"
containers:
- name: ubuntu
image: gcr.io/google-containers/ubuntu:14.04
command: ["/bin/sleep"]
args: ["infinity"]
resources:
requests:
cpu: "1"
memory: "10Mi"
limits:
cpu: "1"
memory: "10Mi"
While kubelet is down issue a graceful deletion to set a deletion timestamp on the pod
kubectl delete pod ubuntu-sleep-2
Verify deletion timestamp is set (pod is pending)
$ k get pod ubuntu-sleep-2 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"ubuntu-sleep-2","namespace":"default"},"spec":{"containers":[{"args":["infinity"],"command":["/bin/sleep"],"image":"gcr.io/google-containers/ubuntu:14.04","name":"ubuntu","resources":{"limits":{"cpu":"1","memory":"10Mi"},"requests":{"cpu":"1","memory":"10Mi"}}}],"nodeName":"kind-worker","nodeSelector":{"cloud.google.com/gke-nodepool":"default-pool"}}}
creationTimestamp: "2023-06-05T18:44:09Z"
deletionGracePeriodSeconds: 30
deletionTimestamp: "2023-06-05T18:44:58Z" <---- deletion timestamp is set
name: ubuntu-sleep-2
namespace: default
resourceVersion: "570"
uid: 5a8cbd76-f425-46a3-bffb-4cc4eec12557
spec:
containers:
- args:
- infinity
command:
- /bin/sleep
image: gcr.io/google-containers/ubuntu:14.04
imagePullPolicy: IfNotPresent
name: ubuntu
resources:
limits:
cpu: "1"
memory: 10Mi
requests:
cpu: "1"
memory: 10Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-5c69j
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kind-worker
nodeSelector:
cloud.google.com/gke-nodepool: default-pool
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-5c69j
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
phase: Pending
qosClass: Guaranteed
Start kubelet
$ docker exec -it kind-worker /bin/bash
root@kind-worker:/# systemctl start kubelet
Observe pod status, the pod is now in failed phase (as expected).
$ k get pod ubuntu-sleep-2 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"ubuntu-sleep-2","namespace":"default"},"spec":{"containers":[{"args":["infinity"],"command":["/bin/sleep"],"image":"gcr.io/google-containers/ubuntu:14.04","name":"ubuntu","resources":{"limits":{"cpu":"1","memory":"10Mi"},"requests":{"cpu":"1","memory":"10Mi"}}}],"nodeName":"kind-worker","nodeSelector":{"cloud.google.com/gke-nodepool":"default-pool"}}}
creationTimestamp: "2023-06-05T18:44:09Z"
deletionGracePeriodSeconds: 30
deletionTimestamp: "2023-06-05T18:44:58Z"
name: ubuntu-sleep-2
namespace: default
resourceVersion: "653"
uid: 5a8cbd76-f425-46a3-bffb-4cc4eec12557
spec:
containers:
- args:
- infinity
command:
- /bin/sleep
image: gcr.io/google-containers/ubuntu:14.04
imagePullPolicy: IfNotPresent
name: ubuntu
resources:
limits:
cpu: "1"
memory: 10Mi
requests:
cpu: "1"
memory: 10Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-5c69j
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kind-worker
nodeSelector:
cloud.google.com/gke-nodepool: default-pool
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-5c69j
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
message: 'Pod was rejected: Predicate NodeAffinity failed'
phase: Failed
reason: NodeAffinity
startTime: "2023-06-05T18:44:52Z"
However, the kubectl delete pod
hangs, the pod is never force deleted by kubelet.
Anything else we need to know?
No response
Kubernetes version
1.27
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 27 (24 by maintainers)
I see, going to open a dedicated DS Issue to move the discussion on priority and the shape of the fix there. Let’s focus here on the Kubelet.
The fix was cherry-picked for the 1.27 branch in https://github.com/kubernetes/kubernetes/pull/118841, so it should be included in 1.27.4.
/retitle Pod graceful deletion hangs if the kubelet already rejected that pod at admission time
There are 2 admission phases for Pods (API server admission, kubelet admission). This issue is about the second phase only.
Long term this might be the best option - having every path go through pod worker and the regular pod termination (i.e.
syncTerminating/syncTerminatedPod
instead of special casing special situations like admission rejection, may be simpler to reason about.Adding this to
HandlePodCleanups
is definitely one option and I think we could expand it to cover this case.The only issue I see handling it there, is there could potentially be a large number of pods that are in terminal phase, not have a deletion timestamp set, and unknown pod worker and pod runtime. For each of these pods, we would have to call
TerminatePod
in every iteration ofHandlePodCleanups
until someone eventually set the deletion timestamp and the pod would be deleted. Looking at the code, it should short-circuit here, but we would need to confirm.Me and @mimowo were also considering two other options:
TerminatePod
during kubelet rejection. I think it should be fine overall, but has a risk that after a kubelet restart, if a previous pod was still in runtime and it’s running, could lead to the pod being deleted by status manager before the orphaned pod is terminated. https://github.com/kubernetes/kubernetes/pull/118599syncPodKill
by pod worker during kubelet rejection. It should go through pod worker and check that all containers are terminated, so it should avoid issue of the first approach. https://github.com/kubernetes/kubernetes/pull/118614It sounds like option 2 or adding this HandlePodCleanups is the optimal paths forward.
Curious to get your thoughts on the options here.