kubernetes: [flaky] SchedulingThroughput - SchedulingThroughput error: scheduler throughput: actual throughput 81.200000 lower than threshold 90.000000]
Which jobs are flaking:
- ci-kubernetes-e2e-gci-gce-scalability
- ci-kubernetes-e2e-gci-gce-scalability-networkpolicies
- pull-kubernetes-e2e-gce-100-performance
Which test(s) are flaking:
testing/density/config.yaml
SchedulingThroughput error: scheduler throughput: actual throughput 81.800000 lower than threshold 90.000000]
Testgrid link:
- https://testgrid.k8s.io/google-gce#gce-cos-master-scalability-100&width=5
- https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce-100-performance&width=5
Anything else we need to know:
Looking just at periodic jobs (not presubmit), the number of failures has been steadily rising since May:
- 3: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-05-30&text=scheduler throughput
- 6: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-06-06&text=scheduler throughput
- 11: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-06-13&text=scheduler throughput
- 16: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-06-20&text=scheduler throughput
- 19: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-06-27&text=scheduler throughput
- 26: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-07-04&text=scheduler throughput
- 34: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-07-11&text=scheduler throughput
- 65: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-07-18&text=scheduler throughput
/sig scalability /sig scheduling
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 35 (35 by maintainers)
With https://github.com/kubernetes/test-infra/pull/18464 and https://github.com/kubernetes/test-infra/pull/18463 this should now be fixed.
Speaking of that, I just noticed today that the job doesn’t request CPU for the test pod, so it can easily be starved. We noticed it was getting assigned the default request of 250m in https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/93307/pull-kubernetes-e2e-gce-100-performance/1285712793332355072/ (see podinfo.json in the artifacts).
Shouldn’t the perf test be requesting a specific amount of CPU for its test pod (and maybe even limiting to that amount to make results over time comparable)?
Not that I’m aware of. If I’m reading http://perf-dash.k8s.io/#/?jobname=gce-100Nodes-master&metriccategoryname=APIServer&metricname=DensityResponsiveness_PrometheusSimple&Resource=pods&Scope=namespace&Subresource=binding&Verb=POST correctly, the performance of posts to pods/binding looks pretty stable over time