kubernetes: Pods go into error state after kubelet-initiated eviction
What happened:
When a pod is evicted under resource pressure, it goes into an error state.
What you expected to happen:
The pod terminates normally.
How to reproduce it (as minimally and precisely as possible):
Note that the 130GB file is based on 90% of total disk space - current utilization to hit the default eviction threshold.
ehashman@fedora:~/src/k8s$ cat ~/tmp/disk-pressure.yml
kind: Pod
apiVersion: v1
metadata:
name: evictme
spec:
terminationGracePeriodSeconds: 10
containers:
- name: busybox
image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
command: ["sh", "-c", "fallocate -l 130G file; sleep 100000"]
ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod
No resources found in default namespace.
ehashman@fedora:~/src/k8s$ cluster/kubectl.sh apply -f ~/tmp/disk-pressure.yml
pod/evictme created
ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod
NAME READY STATUS RESTARTS AGE
evictme 1/1 Running 0 3s
ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod
NAME READY STATUS RESTARTS AGE
evictme 1/1 Evicted 0 10s
ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod
NAME READY STATUS RESTARTS AGE
evictme 0/1 Error 0 21s
Full pod spec after going into error state:
ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod -o yaml
apiVersion: v1
items:
- apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"evictme","namespace":"default"},"spec":{"containers":[{"command":["sh","-c","fallocate -l 130G file; sleep 10000"],"image":"k8s.gcr.io/e2e-test-images/nginx:1.14-2","name":"busybox"}],"terminationGracePeriodSeconds":10}}
creationTimestamp: "2021-10-01T19:10:20Z"
name: evictme
namespace: default
resourceVersion: "519"
uid: e9e27b78-d722-4c2c-96f3-7a22877d75bb
spec:
containers:
- command:
- sh
- -c
- fallocate -l 130G file; sleep 10000
image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
imagePullPolicy: IfNotPresent
name: busybox
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-dctrp
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: 127.0.0.1
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 10
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-dctrp
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-10-01T19:10:59Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-10-01T19:11:13Z"
message: 'containers with unready status: [busybox]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-10-01T19:11:13Z"
message: 'containers with unready status: [busybox]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-10-01T19:10:59Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://ea630f3b1b7e83763df0693d0e617b475433486462abd0ec01611f146c883516
image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
imageID: docker-pullable://k8s.gcr.io/e2e-test-images/nginx@sha256:13616070e3f29de4417eee434a8ef472221c9e51b3d037b5a6b46cef08eb7443
lastState: {}
name: busybox
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: docker://ea630f3b1b7e83763df0693d0e617b475433486462abd0ec01611f146c883516
exitCode: 137
finishedAt: "2021-10-01T19:11:13Z"
reason: Error
startedAt: "2021-10-01T19:11:00Z"
hostIP: 127.0.0.1
phase: Running
podIP: 172.17.0.3
podIPs:
- ip: 172.17.0.3
qosClass: BestEffort
startTime: "2021-10-01T19:10:59Z"
kind: List
metadata:
resourceVersion: ""
selfLink: ""
Anything else we need to know?:
See also https://bugzilla.redhat.com/show_bug.cgi?id=1997478#c25
Eviction tests (https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-eviction) are currently broken (https://github.com/kubernetes/kubernetes/issues/102782), there’s a PR up to fix: https://github.com/kubernetes/kubernetes/pull/104304
Environment:
- Kubernetes version (use
kubectl version
): HEAD/1.23.x (kubeletVersion=“v1.23.0-alpha.3.47+d551560a78292e”) - Cloud provider or hardware configuration: local-up-cluster
- OS (e.g:
cat /etc/os-release
), Kernel (e.g.uname -a
):
ehashman@fedora:~/src/k8s$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
ehashman@fedora:~/src/k8s$ uname -a
Linux fedora 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux
- Install tools: local-up-cluster.sh
- Network plugin and version (if this is a network-related bug): N/A
- Others: Also reproducible on OpenShift 4.9 (1.23-based)
/sig node /cc @rphillips @smarterclayton
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (18 by maintainers)
The containers ARE terminated, we just haven’t retrieved their status. This is because of a decision to not refresh the container status cache in syncTerminatingPods. So syncTerminatingPod is not a bug.
From what I can tell, the issue is that
syncTerminatedPod
is being passed a pod that still has running containers. From kubelet logs:case status.IsTerminated()
checks isTerminatedAt which gets set as part of completeTerminating. It looks like that will always run so long as syncTerminatingPod succeeds.Thus, I feel like syncTerminatingPod must have the bug: it returns successfully, but the containers of the pod aren’t all actually terminated.
Roughly, kubelet syncTerminatingPod sends “Failed” to status manager (which looks like it generates the correct apiserver patch for phase running -> failed), then somehow by the time Kubelet syncTerminatedPod is called it sends “Running but with errors” to status manager, and status manager allows that transition (internally) and then sends it to the apiserver. So the pod goes Running -> Failed -> Running which should never happen (we “lost” that the pod was stopped).
Bug 1: pod patch status probably shouldn’t allow terminal -> nonterminal phase transitions (didn’t we fix that??) Bug 2: pod worker is supposed to be getting most up to date status from status manager before calling syncTerminatedPod, which should include “Failed”, but that isn’t happening? thus the status manager then sends a “non-terminal state” (shouldn’t ever happen)
Roughly, figure out why status passed to syncTerminatedPod is older than what was sent by syncTerminatingPod, then figure out why status manager allowed that transition back to running, then figure out why apiserver didn’t reject.