kubernetes: Large kubemark performance tests failing with timeout during ns deletion

We’ve been seeing these failures continuously in kubemark-5000 for quite some now - https://k8s-testgrid.appspot.com/sig-scalability#kubemark-5000 Even in kubemark-500 we’re occasionally seeing flakes - https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/8806

On first look I’m seeing that hollow-nodes are going pending during test namespace deletion:

I1002 02:45:51.134] Oct  2 02:45:51.134: INFO: POD                     NODE               PHASE    GRACE  CONDITIONS
I1002 02:45:51.135] Oct  2 02:45:51.134: INFO: load-small-14408-4r45s  hollow-node-lmblz  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC ContainersNotReady containers with unready status: [load-small-14408]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC  }]
I1002 02:45:51.135] Oct  2 02:45:51.134: INFO: load-small-14408-q8tfd  hollow-node-lm9c4  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC ContainersNotReady containers with unready status: [load-small-14408]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC  }]

Digging into it now. Sorry for the delay, I’ve been busy with release scalability validation.

cc @kubernetes/sig-scalability-bugs @wojtek-t @gmarek

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 81 (81 by maintainers)

Commits related to this issue

Merge pull request #53793 from wojtek-t/separate_leader_election_in_scheduler Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructio... — committed to kubernetes/kubernetes by deleted user 7 years ago
Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to kubernetes/kubernetes by deleted user 7 years ago
Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to sttts/apimachinery by k8s-publish-robot 7 years ago
Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to sttts/apimachinery by k8s-publish-robot 7 years ago
Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to sttts/apimachinery by k8s-publish-robot 7 years ago
Merge pull request #53989 from shyamjvs/use-counter-in-scheduler Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="h... — committed to kubernetes/kubernetes by deleted user 7 years ago
Merge pull request #55026 from dashpole/network_mock_docker Automatic merge from submit-queue (batch tested with PRs 55893, 55906, 55026). If you want to cherry-pick this change to another branch, pl... — committed to kubernetes/kubernetes by deleted user 7 years ago
Merge pull request #56821 from dashpole/fake_client_running_containers Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a h... — committed to kubernetes/kubernetes by deleted user 7 years ago
Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to akhilerm/apimachinery by k8s-publishing-bot 7 years ago

Most upvoted comments

This looks like the culprit log line: remote_runtime.go:246] RemoveContainer "4d3a382426ddc7b9" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "4d3a382426ddc7b9": container not stopped It looks like even though we are returning an error when starting the container, we have already added that container to the “RunningContainers” list in the fake client during CreateContainer. I spoke with @Random-Liu, and we think the best solution for now is to change ListContainers to remove those containers which have not been started when options.All is not set. I will post a PR.

dashpole on Dec 4, 2017

Closing this as we finally have the kubemark-5000 job green - https://k8s-testgrid.appspot.com/sig-scalability-kubemark#kubemark-5000 (run 737) Thanks everyone for the effort!

/close

shyamjvs on Nov 30, 2017

I think the best approach may be to ensure that we do not delete pods which are not yet running on hollow kubelets.

We are not deleting those pods - we’re just deleting the RCs, the pods are being deleted in the background. In general, I’m not very convinced if that is the best idea as we shouldn’t be artificially changing the behavior of our tests to just make it pass for kubemark. We’re trying to simulate real cluster using kubemark, and imposing a restriction like ‘deleting a pod is allowed only after it is running’ on the latter is causing behavioral differences across both (even neglecting the failing e2e test).

My guess now is that the mocked kubelet allows creation of a container even if the pod’s container is removed, since we do not mock container networking:

In that case we should be mocking the same behaviour on hollow-kubelet too. Can this be done somehow?

@wojtek-t - WDYT about this?

shyamjvs on Oct 26, 2017

If we could simulate these in sched_perf ; by adding large amounts of the CheckNodePredicate stuff it would be ideal.

jayunit100 on Oct 12, 2017