kubernetes: Pods go into error state after kubelet-initiated eviction

What happened:

When a pod is evicted under resource pressure, it goes into an error state.

What you expected to happen:

The pod terminates normally.

How to reproduce it (as minimally and precisely as possible):

Note that the 130GB file is based on 90% of total disk space - current utilization to hit the default eviction threshold.

ehashman@fedora:~/src/k8s$ cat ~/tmp/disk-pressure.yml 
kind: Pod
apiVersion: v1
metadata:
  name: evictme
spec:
  terminationGracePeriodSeconds: 10
  containers:
  - name: busybox
    image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
    command: ["sh", "-c", "fallocate -l 130G file; sleep 100000"]

ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod
No resources found in default namespace.

ehashman@fedora:~/src/k8s$ cluster/kubectl.sh apply -f ~/tmp/disk-pressure.yml 
pod/evictme created

ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod
NAME      READY   STATUS    RESTARTS   AGE
evictme   1/1     Running   0          3s

ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod
NAME      READY   STATUS    RESTARTS   AGE
evictme   1/1     Evicted   0          10s

ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod
NAME      READY   STATUS   RESTARTS   AGE
evictme   0/1     Error    0          21s

Full pod spec after going into error state:

ehashman@fedora:~/src/k8s$ cluster/kubectl.sh get pod -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"evictme","namespace":"default"},"spec":{"containers":[{"command":["sh","-c","fallocate -l 130G file; sleep 10000"],"image":"k8s.gcr.io/e2e-test-images/nginx:1.14-2","name":"busybox"}],"terminationGracePeriodSeconds":10}}
    creationTimestamp: "2021-10-01T19:10:20Z"
    name: evictme
    namespace: default
    resourceVersion: "519"
    uid: e9e27b78-d722-4c2c-96f3-7a22877d75bb
  spec:
    containers:
    - command:
      - sh
      - -c
      - fallocate -l 130G file; sleep 10000
      image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
      imagePullPolicy: IfNotPresent
      name: busybox
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-dctrp
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: 127.0.0.1
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 10
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-dctrp
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2021-10-01T19:10:59Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2021-10-01T19:11:13Z"
      message: 'containers with unready status: [busybox]'
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2021-10-01T19:11:13Z"
      message: 'containers with unready status: [busybox]'
      reason: ContainersNotReady
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2021-10-01T19:10:59Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: docker://ea630f3b1b7e83763df0693d0e617b475433486462abd0ec01611f146c883516
      image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
      imageID: docker-pullable://k8s.gcr.io/e2e-test-images/nginx@sha256:13616070e3f29de4417eee434a8ef472221c9e51b3d037b5a6b46cef08eb7443
      lastState: {}
      name: busybox
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: docker://ea630f3b1b7e83763df0693d0e617b475433486462abd0ec01611f146c883516
          exitCode: 137
          finishedAt: "2021-10-01T19:11:13Z"
          reason: Error
          startedAt: "2021-10-01T19:11:00Z"
    hostIP: 127.0.0.1
    phase: Running
    podIP: 172.17.0.3
    podIPs:
    - ip: 172.17.0.3
    qosClass: BestEffort
    startTime: "2021-10-01T19:10:59Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Anything else we need to know?:

See also https://bugzilla.redhat.com/show_bug.cgi?id=1997478#c25

Eviction tests (https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-eviction) are currently broken (https://github.com/kubernetes/kubernetes/issues/102782), there’s a PR up to fix: https://github.com/kubernetes/kubernetes/pull/104304

Environment:

  • Kubernetes version (use kubectl version): HEAD/1.23.x (kubeletVersion=“v1.23.0-alpha.3.47+d551560a78292e”)
  • Cloud provider or hardware configuration: local-up-cluster
  • OS (e.g: cat /etc/os-release), Kernel (e.g. uname -a):
ehashman@fedora:~/src/k8s$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
ehashman@fedora:~/src/k8s$ uname -a
Linux fedora 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux
  • Install tools: local-up-cluster.sh
  • Network plugin and version (if this is a network-related bug): N/A
  • Others: Also reproducible on OpenShift 4.9 (1.23-based)

/sig node /cc @rphillips @smarterclayton

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

The containers ARE terminated, we just haven’t retrieved their status. This is because of a decision to not refresh the container status cache in syncTerminatingPods. So syncTerminatingPod is not a bug.

From what I can tell, the issue is that syncTerminatedPod is being passed a pod that still has running containers. From kubelet logs:

I0930 14:15:06.651815  871656 kubelet.go:1822] "Pod termination stopped all running containers" pod="default/evictme" podUID=04cc53c0-801a-4c36-ae0e-07b0efb3dca2
I0930 14:15:06.651821  871656 kubelet.go:1824] "syncTerminatingPod exit" pod="default/evictme" podUID=04cc53c0-801a-4c36-ae0e-07b0efb3dca2
I0930 14:15:06.651827  871656 pod_workers.go:868] "Pod terminated all containers successfully" pod="default/evictme" podUID=04cc53c0-801a-4c36-ae0e-07b0efb3dca2
I0930 14:15:06.651836  871656 pod_workers.go:833] "Processing pod event done" pod="default/evictme" podUID=04cc53c0-801a-4c36-ae0e-07b0efb3dca2 updateType=1
I0930 14:15:06.651841  871656 pod_workers.go:742] "Processing pod event" pod="default/evictme" podUID=04cc53c0-801a-4c36-ae0e-07b0efb3dca2 updateType=2
...
I0930 14:15:07.244734  871656 kubelet_pods.go:931] "Pod is terminated, but some container status has not yet been reported" pod="default/evictme" running=1

case status.IsTerminated() checks isTerminatedAt which gets set as part of completeTerminating. It looks like that will always run so long as syncTerminatingPod succeeds.

Thus, I feel like syncTerminatingPod must have the bug: it returns successfully, but the containers of the pod aren’t all actually terminated.

Roughly, kubelet syncTerminatingPod sends “Failed” to status manager (which looks like it generates the correct apiserver patch for phase running -> failed), then somehow by the time Kubelet syncTerminatedPod is called it sends “Running but with errors” to status manager, and status manager allows that transition (internally) and then sends it to the apiserver. So the pod goes Running -> Failed -> Running which should never happen (we “lost” that the pod was stopped).

Bug 1: pod patch status probably shouldn’t allow terminal -> nonterminal phase transitions (didn’t we fix that??) Bug 2: pod worker is supposed to be getting most up to date status from status manager before calling syncTerminatedPod, which should include “Failed”, but that isn’t happening? thus the status manager then sends a “non-terminal state” (shouldn’t ever happen)

Roughly, figure out why status passed to syncTerminatedPod is older than what was sent by syncTerminatingPod, then figure out why status manager allowed that transition back to running, then figure out why apiserver didn’t reject.