kubernetes: Large kubemark performance tests failing with timeout during ns deletion

We’ve been seeing these failures continuously in kubemark-5000 for quite some now - https://k8s-testgrid.appspot.com/sig-scalability#kubemark-5000 Even in kubemark-500 we’re occasionally seeing flakes - https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/8806

On first look I’m seeing that hollow-nodes are going pending during test namespace deletion:

I1002 02:45:51.134] Oct  2 02:45:51.134: INFO: POD                     NODE               PHASE    GRACE  CONDITIONS
I1002 02:45:51.135] Oct  2 02:45:51.134: INFO: load-small-14408-4r45s  hollow-node-lmblz  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC ContainersNotReady containers with unready status: [load-small-14408]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC  }]
I1002 02:45:51.135] Oct  2 02:45:51.134: INFO: load-small-14408-q8tfd  hollow-node-lm9c4  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC ContainersNotReady containers with unready status: [load-small-14408]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC  }]

Digging into it now. Sorry for the delay, I’ve been busy with release scalability validation.

cc @kubernetes/sig-scalability-bugs @wojtek-t @gmarek

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 81 (81 by maintainers)

Commits related to this issue

Most upvoted comments

This looks like the culprit log line: remote_runtime.go:246] RemoveContainer "4d3a382426ddc7b9" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "4d3a382426ddc7b9": container not stopped It looks like even though we are returning an error when starting the container, we have already added that container to the “RunningContainers” list in the fake client during CreateContainer. I spoke with @Random-Liu, and we think the best solution for now is to change ListContainers to remove those containers which have not been started when options.All is not set. I will post a PR.

Closing this as we finally have the kubemark-5000 job green - https://k8s-testgrid.appspot.com/sig-scalability-kubemark#kubemark-5000 (run 737) Thanks everyone for the effort!

/close

I think the best approach may be to ensure that we do not delete pods which are not yet running on hollow kubelets.

We are not deleting those pods - we’re just deleting the RCs, the pods are being deleted in the background. In general, I’m not very convinced if that is the best idea as we shouldn’t be artificially changing the behavior of our tests to just make it pass for kubemark. We’re trying to simulate real cluster using kubemark, and imposing a restriction like ‘deleting a pod is allowed only after it is running’ on the latter is causing behavioral differences across both (even neglecting the failing e2e test).

My guess now is that the mocked kubelet allows creation of a container even if the pod’s container is removed, since we do not mock container networking:

In that case we should be mocking the same behaviour on hollow-kubelet too. Can this be done somehow?

@wojtek-t - WDYT about this?

If we could simulate these in sched_perf ; by adding large amounts of the CheckNodePredicate stuff it would be ideal.