kubernetes: kube get pods --watch doesn't always report terminal status with exitCode during Deployment update
What happened:
When updating an existing Deployment (kubectl replace -f deployment.yaml
), new pods are being created and old pods are killed. We are constantly doing kubectl get pods --watch
to do some internal monitoring, however sometimes (cca 30-50% of the cases) we don’t receive the final update (for the old pod that is being stopped). Expecting somthing like:
"state": {
"terminated": {
"containerID": "docker://86714d667e1468048d982f2776020e4dc572985031eee75246f3c7f5a0280734",
"exitCode": 0,
"finishedAt": "2021-08-05T07:48:13Z",
"reason": "Completed",
"startedAt": "2021-08-05T07:45:56Z"
}
}
Instead the last status update for such pod is weird:
"state": {
"waiting": {
"reason": "ContainerCreating"
}
}
(complete logs below) We really need to know the exitCode of all terminated containers/pods during upgrade.
What you expected to happen:
Receive the exitCode (included in the final pod status update notification) for all pods during Deployment upgrade.
How to reproduce it (as minimally and precisely as possible):
It is hard to consistently reproduce, but it happens to us with almost every Deployment update for at least one pod.
Deployment spec: deployment.yml
I am attaching 3 logs ( all done via kubectl get pods --output json --watch-only --output-watch-events | jq 'del(.object.metadata.managedFields)'
):
-
Initial creation of the Deployment(not really interesting)
-
Logs from deployment upgrade that is OK (changed resources.limits.memory to trigger the upgrade), using
kubectl replace
that is without problems. All pods correctly reported the final terminated status with exitCode. This log is mainly for complete picture, nothing much interesting there. All pods starts terminating (MODIFIED event with deletionTimestamp appears, then they shift fromstate.running
tostate.terminated
with attached exit code. -
Logs from second (faulty) upgrade. <— THIS is the important one. Again just changed memory using
kubectl replace
And one pod doesn’t report the exitcode at all. This one is interesting:- 3 pods are running before the upgrade:
example-deployment-5f846c8bc5-kzht5 example-deployment-5f846c8bc5-n49g2 example-deployment-5f846c8bc5-k8ppq
- First two of them start terminating (deletionTimestamp appears, then their containerStatuses change to
state.terminated
) - the last one (k8ppq) starts terminating as well (deletionTimestamp), however it fails to report how it ended - and the last status is the weird
ContainerCreating
(line 3375) - and no state.terminated with the attached exit code.
- 3 pods are running before the upgrade:
-
I am also attaching one additional log of events (done by
kube get events --watch-only --output-watch-events --output json | jq 'del(.object.metadata.managedFields)'
). But nothing useful there (at least nothing I am aware of)
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.13-eks-8df270", GitCommit:"8df2700a72a2598fa3a67c05126fa158fd839620", GitTreeState:"clean", BuildDate:"2021-07-31T01:36:57Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration:
AWS EKS, self managed worker nodes on EC2 isntances with https://cloud-images.ubuntu.com/docs/aws/eks/
- OS (e.g:
cat /etc/os-release
):
Ubuntu 20.04.2 LTS
- Kernel (e.g.
uname -a
):
5.4.0-1037-aws
- Install tools:
AWS CloudFormation templates - mostly from official AWS docs
- Network plugin and version (if this is a network-related bug):
Calico CNI
calico-node - docker.io/calico/cni:v3.19.1
tigera-operator - quay.io/tigera/operator:v1.17.4
- Others:
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 23 (12 by maintainers)
@SergeyKanzhelev @bydga I have scripted reproducing of the issue and confirmed it on 1.21 and 1.22. However, it appears fixed on 1.23 and 1.25.
To reproduce I run a loop of 10 retries of create / replace / delete a deployment and check if the output (
out.txt
, see below) contains “The container could not be located when the pod was deleted” or “ContainerStatusUnknown”.On one terminal I run:
on the other terminal run
python repro.py
against a cluster with 3 nodes:@codearky here is the syslog from the “faulty” upgrade
quick summary:
3 pods were running:
hera-example-api-kube-bydga-daemon-695b4c6f9-4hlss
hera-example-api-kube-bydga-daemon-695b4c6f9-6vt29
hera-example-api-kube-bydga-daemon-695b4c6f9-t594g
they were replaced during the uprade to 3 new ones:
hera-example-api-kube-bydga-daemon-85949bbbc4-v5955
hera-example-api-kube-bydga-daemon-85949bbbc4-rzwfl
hera-example-api-kube-bydga-daemon-85949bbbc4-l9b5k
the one which failed to report the exitcode is the
hera-example-api-kube-bydga-daemon-695b4c6f9-6vt29
I’m attaching 2 syslogs, because we had two servers for this test. The faulty pod is in the first one: syslog1.txt syslog2.txt
also the output of this current
kubectl get pods -w
: watchpods.txtI was able to reproduce, but feels like these false containers status reports originate in kubelet (?) Will keep investigating and update /assign