kubernetes: [Failing Test] gce-master-scale-correctness: [sig-cli] Kubectl client Simple pod should return command exit codes

Which jobs are failing: gce-master-scale-correctness

Which test(s) are failing: [sig-cli] Kubectl client Simple pod should return command exit codes

Since when has it been failing: 02/05

Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-master-scale-correctness

Reason for failure:

Feb  5 12:32:40.354: Unexpected error:
    <exec.CodeExitError>: {
        Err: {
            s: "error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://35.243.250.2 --kubeconfig=/workspace/.kube/config --namespace=kubectl-3790 run -i --image=docker.io/library/busybox:1.29 --restart=Never failure-4 --leave-stdin-open -- /bin/sh -c exit 42:\nCommand stdout:\n\nstderr:\nError from server (NotFound): pods \"failure-4\" not found\n\nerror:\nexit status 1",
        },
        Code: 1,
    }
    error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://35.243.250.2 --kubeconfig=/workspace/.kube/config --namespace=kubectl-3790 run -i --image=docker.io/library/busybox:1.29 --restart=Never failure-4 --leave-stdin-open -- /bin/sh -c exit 42:
    Command stdout:
    
    stderr:
    Error from server (NotFound): pods "failure-4" not found
    
    error:
    exit status 1

Anything else we need to know:

/sig cli

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (24 by maintainers)

Most upvoted comments

Actually I went ahead, and fixed couple other commands too - so it’s no longer one-liner. But still short one 😃

OK - I think I know what is going on here. Here is what is happening (for the test case from two comments above):

And waitForPod is broken. The problem is the preconditionFunc: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubectl/pkg/cmd/run/run.go#L460 which is below passed to the UntilWithSync function

UntilWithSync underneath is creating an informer and calling that function against its state: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/watch/until.go#L143

But the informer is initialized from apiserver watchcache, so it may be lagging. In large clusters we are observing lags in the order of small hundreds of milliseconds and that seem to be enough.

What is happening is that:

  • informer is initialized (but the pod creation is not yet reflected in it)
  • the preconditionFunc is called
  • the state doesn’t contain that pod (yet)
  • the whole thing fails with that error

I actually believe, that just removing the preconditionFunc should actually solve the problem. I opened https://github.com/kubernetes/kubernetes/pull/90417 for that.