kubernetes: Pod is removed from store but the containers are not terminated
What happened: Pod is removed from store but the associated containers can run on the Node for a very long time.
What you expected to happen: I would expect to have a consistent behaviour and when a Pod is removed from the store, the associated containers to be terminated.
How to reproduce it (as minimally and precisely as possible):
- Apply the following Pod:
apiVersion: v1
kind: Pod
metadata:
name: alpine
spec:
activeDeadlineSeconds: 30
containers:
- command:
- sh
- -c
- sleep 3600
image: alpine:3.10.3
imagePullPolicy: IfNotPresent
name: alpine
terminationGracePeriodSeconds: 600
-
Ensure that after 30s (.spec.activeDeadlineSeconds) the Pod will be with .status.phase=Failed and .status.reason=DeadlineExceeded. Ensure that the container will receive SIGTERM signal at this point of the time.
-
Delete the Pod after it is
DeadlineExceeded
.
$ k delete po alpine
-
Ensure that the deletion completes right away and the pod is removed from the store.
-
Ensure that the associated containers will continue to run on the Node until
.spec.terminationGracePeriodSeconds
is passed.
/ # docker ps | grep alpine
f2fdf243db1a alpine "sh -c 'sleep 3600'" 3 minutes ago Up 3 minutes k8s_alpine_alpine_default_c8aa37a1-d248-4831-a06f-e9ac4bac4a62_0
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): v1.15.10
$ k version
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.10", GitCommit:"1bea6c00a7055edef03f1d4bb58b773fa8917f11", GitTreeState:"clean", BuildDate:"2020-02-11T20:05:26Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release
): - Kernel (e.g.
uname -a
): - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 40 (30 by maintainers)
/triage accepted /area kubelet /area docker
https://github.com/kubernetes/kubernetes/pull/98507 may fix this. @gjkim42 could you help review the fix?
The problem is that a pod after activeDeadlineSeconds goes into the
PodFailed
phase before all containers are killed. And the other parts of the kubernetes think that the containers of the pod withPodFailed
phase are already killed (so they think that the pod can be deleted immediately).https://github.com/kubernetes/kubernetes/blob/525b8e5cd6d410034058397b282386f21cbc2f20/pkg/kubelet/kubelet_pods.go#L1469-L1476
https://kubernetes.io/docs/concepts/workloads/pods/_print/#pod-phase
Maybe we need to redefine the
PodFailed
phase or make the pod go into thePodFailed
phase after all its containers have terminated.