kubernetes: Inconsistent POD status reporting
What happened?
I run service mesh sidecar (in my case Istio) in PODs that are created/controlled by Jobs. “Main” container shuts down the istio sidecar via API call to 127.0.0.1 and exits with an actual application code. The issue is when “main” container finishes with error, POD status often displays Completed when called via kubectl get pod.
NAME READY STATUS RESTARTS AGE
job-istio-proxy-test--1-zdlc6 0/2 Completed 0 47s
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-01-24T10:04:11Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-01-24T10:04:26Z"
message: 'containers with unready status: [somejob istio-proxy]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-01-24T10:04:26Z"
message: 'containers with unready status: [somejob istio-proxy]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-01-24T10:04:10Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://ccf54140202e07f2bd37151dee171fc744158ff81c4a58426bad0518f8dd6c6d
image: docker.io/istio/proxyv2:1.12.2
imageID: docker.io/istio/proxyv2@sha256:f26717efc7f6e0fe928760dd353ed004ea35444f5aa6d41341a003e7610cd26f
lastState: {}
name: istio-proxy
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://ccf54140202e07f2bd37151dee171fc744158ff81c4a58426bad0518f8dd6c6d
exitCode: 0
finishedAt: "2022-01-24T10:04:25Z"
reason: Completed
startedAt: "2022-01-24T10:04:13Z"
- containerID: containerd://1f515cb65e4c3a8f206ae0cbbe19720fb4e734361ec6740156f53e1f5e002278
image: docker.io/amouat/network-utils:latest
imageID: docker.io/amouat/network-utils@sha256:c4da08f9dac831b8f83ffc63f4a7f327754e20aeac1e9ae68d7727ccc25b8172
lastState: {}
name: somejob
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://1f515cb65e4c3a8f206ae0cbbe19720fb4e734361ec6740156f53e1f5e002278
exitCode: 1
finishedAt: "2022-01-24T10:04:30Z"
reason: Error
startedAt: "2022-01-24T10:04:13Z"
hostIP: 10.10.140.140
initContainerStatuses:
- containerID: containerd://a2c5c43f2730d7d16892b2197d438c87e1c25a9fd322e639e6a2b9702c881c0a
image: docker.io/istio/proxyv2:1.12.2
imageID: docker.io/istio/proxyv2@sha256:f26717efc7f6e0fe928760dd353ed004ea35444f5aa6d41341a003e7610cd26f
lastState: {}
name: istio-validation
ready: true
restartCount: 0
state:
terminated:
containerID: containerd://a2c5c43f2730d7d16892b2197d438c87e1c25a9fd322e639e6a2b9702c881c0a
exitCode: 0
finishedAt: "2022-01-24T10:04:11Z"
reason: Completed
startedAt: "2022-01-24T10:04:11Z"
phase: Failed
podIP: 10.10.177.125
podIPs:
- ip: 10.10.177.125
qosClass: Burstable
startTime: "2022-01-24T10:04:10Z"
What did you expect to happen?
It should return status Error when one the containers in the POD fails.
I believe that is due to POD Status field is calculated incorrectly (it takes the value of the reason of the last container in the pod.Status.ContainerStatuses array)
https://github.com/kubernetes/kubernetes/blob/5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e/pkg/printers/internalversion/printers.go#L812-L813
The workaround for this situation is to name actual application container with first letters like abc and sidecar with last ones xyz.
How can we reproduce it (as minimally and precisely as possible)?
Test job
apiVersion: batch/v1
kind: Job
metadata:
name: job-istio-proxy-test
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 600
template:
metadata:
labels:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: somejob
image: amouat/network-utils:latest
command:
- /bin/bash
- -c
- |
# Wait for sidecar to be ready
until curl -fsSI -o /dev/null http://localhost:15021/healthz/ready; do echo \"Waiting for Sidecar...\"; sleep 2; done; echo "Sidecar available. Running the command..."
# Simulate some useful job
sleep 10
# Simulate job failure
false
# Shutdown sidecar and return job exit code
ret=$(echo $?); echo "Command completed. Terminating sidecar..."; curl -fsSI -o /dev/null -X POST http://localhost:15000/quitquitquit; sleep 5; exit $ret
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 10m
memory: 256Mi
restartPolicy: Never
securityContext:
runAsUser: 65000
runAsGroup: 65000
Anything else we need to know?
Istio version used - 1.12.2
Kubernetes version
1.22.5
Cloud provider
OS version
Ubuntu 20.04.3
5.4.0-86-generic #97-Ubuntu SMP Fri Sep 17 19:19:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 19 (9 by maintainers)
Yea, i agrees. So I add
hasError. if hasError is true,the pod status will be reset toError