kubernetes: Inconsistent POD status reporting

What happened?

I run service mesh sidecar (in my case Istio) in PODs that are created/controlled by Jobs. “Main” container shuts down the istio sidecar via API call to 127.0.0.1 and exits with an actual application code. The issue is when “main” container finishes with error, POD status often displays Completed when called via kubectl get pod.

NAME                            READY   STATUS      RESTARTS   AGE
job-istio-proxy-test--1-zdlc6   0/2     Completed   0          47s
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:11Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:26Z"
    message: 'containers with unready status: [somejob istio-proxy]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:26Z"
    message: 'containers with unready status: [somejob istio-proxy]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:10Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://ccf54140202e07f2bd37151dee171fc744158ff81c4a58426bad0518f8dd6c6d
    image: docker.io/istio/proxyv2:1.12.2
    imageID: docker.io/istio/proxyv2@sha256:f26717efc7f6e0fe928760dd353ed004ea35444f5aa6d41341a003e7610cd26f
    lastState: {}
    name: istio-proxy
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://ccf54140202e07f2bd37151dee171fc744158ff81c4a58426bad0518f8dd6c6d
        exitCode: 0
        finishedAt: "2022-01-24T10:04:25Z"
        reason: Completed
        startedAt: "2022-01-24T10:04:13Z"
  - containerID: containerd://1f515cb65e4c3a8f206ae0cbbe19720fb4e734361ec6740156f53e1f5e002278
    image: docker.io/amouat/network-utils:latest
    imageID: docker.io/amouat/network-utils@sha256:c4da08f9dac831b8f83ffc63f4a7f327754e20aeac1e9ae68d7727ccc25b8172
    lastState: {}
    name: somejob
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://1f515cb65e4c3a8f206ae0cbbe19720fb4e734361ec6740156f53e1f5e002278
        exitCode: 1
        finishedAt: "2022-01-24T10:04:30Z"
        reason: Error
        startedAt: "2022-01-24T10:04:13Z"
  hostIP: 10.10.140.140
  initContainerStatuses:
  - containerID: containerd://a2c5c43f2730d7d16892b2197d438c87e1c25a9fd322e639e6a2b9702c881c0a
    image: docker.io/istio/proxyv2:1.12.2
    imageID: docker.io/istio/proxyv2@sha256:f26717efc7f6e0fe928760dd353ed004ea35444f5aa6d41341a003e7610cd26f
    lastState: {}
    name: istio-validation
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://a2c5c43f2730d7d16892b2197d438c87e1c25a9fd322e639e6a2b9702c881c0a
        exitCode: 0
        finishedAt: "2022-01-24T10:04:11Z"
        reason: Completed
        startedAt: "2022-01-24T10:04:11Z"
  phase: Failed
  podIP: 10.10.177.125
  podIPs:
  - ip: 10.10.177.125
  qosClass: Burstable
  startTime: "2022-01-24T10:04:10Z"

What did you expect to happen?

It should return status Error when one the containers in the POD fails. I believe that is due to POD Status field is calculated incorrectly (it takes the value of the reason of the last container in the pod.Status.ContainerStatuses array) https://github.com/kubernetes/kubernetes/blob/5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e/pkg/printers/internalversion/printers.go#L812-L813 The workaround for this situation is to name actual application container with first letters like abc and sidecar with last ones xyz.

How can we reproduce it (as minimally and precisely as possible)?

Test job

apiVersion: batch/v1
kind: Job
metadata:
  name: job-istio-proxy-test
spec:
  backoffLimit: 0
  ttlSecondsAfterFinished: 600
  template:
    metadata:
      labels:
        sidecar.istio.io/inject: "true"
    spec:
      containers:
      - name: somejob
        image: amouat/network-utils:latest
        command:
        - /bin/bash
        - -c
        - |
          # Wait for sidecar to be ready
          until curl -fsSI -o /dev/null http://localhost:15021/healthz/ready; do echo \"Waiting for Sidecar...\"; sleep 2; done; echo "Sidecar available. Running the command..."
          # Simulate some useful job
          sleep 10
          # Simulate job failure
          false
          # Shutdown sidecar and return job exit code
          ret=$(echo $?); echo "Command completed. Terminating sidecar..."; curl -fsSI -o /dev/null -X POST http://localhost:15000/quitquitquit; sleep 5; exit $ret
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
      restartPolicy: Never
      securityContext:
        runAsUser: 65000
        runAsGroup: 65000

Anything else we need to know?

Istio version used - 1.12.2

Kubernetes version

1.22.5

Cloud provider

On premise

OS version

Ubuntu 20.04.3
5.4.0-86-generic #97-Ubuntu SMP Fri Sep 17 19:19:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

kubespray 1.18.0

Container runtime (CRI) and and version (if applicable)

containerd 1.5.9

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 19 (9 by maintainers)

Most upvoted comments

@kkkkun Hi, any news? PR is waiting for review. https://github.com/kubernetes/kubernetes/pull/107865

I think pod status will be ‘Completed’ only when container.State.Terminated.ExitCode == 0 && len(container.State.Terminated.Reason) != 0. Otherwise, it should be Error. https://github.com/kkkkun/kubernetes/blob/f44a6791e8d072cfbba2b77528bf6ff5b4336ffc/pkg/printers/internalversion/printers.go#L814

But POD status should be Error if any of the containers returns with non-zero exit code.

Yea, i agrees. So I add hasError . if hasError is true,the pod status will be reset to Error