kubernetes: Large kubemark performance tests failing with timeout during ns deletion
We’ve been seeing these failures continuously in kubemark-5000 for quite some now - https://k8s-testgrid.appspot.com/sig-scalability#kubemark-5000 Even in kubemark-500 we’re occasionally seeing flakes - https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/8806
On first look I’m seeing that hollow-nodes are going pending during test namespace deletion:
I1002 02:45:51.134] Oct 2 02:45:51.134: INFO: POD NODE PHASE GRACE CONDITIONS
I1002 02:45:51.135] Oct 2 02:45:51.134: INFO: load-small-14408-4r45s hollow-node-lmblz Pending 1s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC ContainersNotReady containers with unready status: [load-small-14408]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC }]
I1002 02:45:51.135] Oct 2 02:45:51.134: INFO: load-small-14408-q8tfd hollow-node-lm9c4 Pending 1s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC ContainersNotReady containers with unready status: [load-small-14408]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-10-01 20:26:46 +0000 UTC }]
Digging into it now. Sorry for the delay, I’ve been busy with release scalability validation.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 81 (81 by maintainers)
Commits related to this issue
- Merge pull request #53793 from wojtek-t/separate_leader_election_in_scheduler Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructio... — committed to kubernetes/kubernetes by deleted user 7 years ago
- Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to kubernetes/kubernetes by deleted user 7 years ago
- Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to sttts/apimachinery by k8s-publish-robot 7 years ago
- Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to sttts/apimachinery by k8s-publish-robot 7 years ago
- Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to sttts/apimachinery by k8s-publish-robot 7 years ago
- Merge pull request #53989 from shyamjvs/use-counter-in-scheduler Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="h... — committed to kubernetes/kubernetes by deleted user 7 years ago
- Merge pull request #55026 from dashpole/network_mock_docker Automatic merge from submit-queue (batch tested with PRs 55893, 55906, 55026). If you want to cherry-pick this change to another branch, pl... — committed to kubernetes/kubernetes by deleted user 7 years ago
- Merge pull request #56821 from dashpole/fake_client_running_containers Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a h... — committed to kubernetes/kubernetes by deleted user 7 years ago
- Merge pull request #53720 from shyamjvs/test-kubemark Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to akhilerm/apimachinery by k8s-publishing-bot 7 years ago
This looks like the culprit log line:
remote_runtime.go:246] RemoveContainer "4d3a382426ddc7b9" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "4d3a382426ddc7b9": container not stopped
It looks like even though we are returning an error when starting the container, we have already added that container to the “RunningContainers” list in the fake client during CreateContainer. I spoke with @Random-Liu, and we think the best solution for now is to change ListContainers to remove those containers which have not been started when options.All is not set. I will post a PR.Closing this as we finally have the kubemark-5000 job green - https://k8s-testgrid.appspot.com/sig-scalability-kubemark#kubemark-5000 (run 737) Thanks everyone for the effort!
/close
We are not deleting those pods - we’re just deleting the RCs, the pods are being deleted in the background. In general, I’m not very convinced if that is the best idea as we shouldn’t be artificially changing the behavior of our tests to just make it pass for kubemark. We’re trying to simulate real cluster using kubemark, and imposing a restriction like ‘deleting a pod is allowed only after it is running’ on the latter is causing behavioral differences across both (even neglecting the failing e2e test).
In that case we should be mocking the same behaviour on hollow-kubelet too. Can this be done somehow?
@wojtek-t - WDYT about this?
If we could simulate these in sched_perf ; by adding large amounts of the CheckNodePredicate stuff it would be ideal.