kubernetes: [Failing Test] gce-master-scale-correctness: [sig-cli] Kubectl client Simple pod should return command exit codes
Which jobs are failing: gce-master-scale-correctness
Which test(s) are failing: [sig-cli] Kubectl client Simple pod should return command exit codes
Since when has it been failing: 02/05
Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-master-scale-correctness
Reason for failure:
Feb 5 12:32:40.354: Unexpected error:
<exec.CodeExitError>: {
Err: {
s: "error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://35.243.250.2 --kubeconfig=/workspace/.kube/config --namespace=kubectl-3790 run -i --image=docker.io/library/busybox:1.29 --restart=Never failure-4 --leave-stdin-open -- /bin/sh -c exit 42:\nCommand stdout:\n\nstderr:\nError from server (NotFound): pods \"failure-4\" not found\n\nerror:\nexit status 1",
},
Code: 1,
}
error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://35.243.250.2 --kubeconfig=/workspace/.kube/config --namespace=kubectl-3790 run -i --image=docker.io/library/busybox:1.29 --restart=Never failure-4 --leave-stdin-open -- /bin/sh -c exit 42:
Command stdout:
stderr:
Error from server (NotFound): pods "failure-4" not found
error:
exit status 1
Anything else we need to know:
/sig cli
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (24 by maintainers)
Actually I went ahead, and fixed couple other commands too - so it’s no longer one-liner. But still short one 😃
OK - I think I know what is going on here. Here is what is happening (for the test case from two comments above):
And waitForPod is broken. The problem is the preconditionFunc: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubectl/pkg/cmd/run/run.go#L460 which is below passed to the UntilWithSync function
UntilWithSync underneath is creating an informer and calling that function against its state: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/watch/until.go#L143
But the informer is initialized from apiserver watchcache, so it may be lagging. In large clusters we are observing lags in the order of small hundreds of milliseconds and that seem to be enough.
What is happening is that:
I actually believe, that just removing the preconditionFunc should actually solve the problem. I opened https://github.com/kubernetes/kubernetes/pull/90417 for that.