kubernetes: Failing/Flaking Test: E2E: [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController Should scale from 5 pods to 3 pods and from 3 to 1 and verify decision stability

Test: https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke-serial&show-stale-tests=

Example: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-serial/7169

The HPA tests for gce-serial are flaking rather regularly; they fail about 2/3 of the time for the last week, and are often the cause of a failed test run. Can we de-flake these?

Over multiple test runs the problem seems to be the number of replicas jumping up to 4 unexpectedly:

Oct  3 11:39:13.153: INFO: ConsumeCPU URL: {https   35.232.126.216 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-qjvjw/services/rc-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=100 }
Oct  3 11:39:22.808: INFO: expecting there to be 3 replicas (are: 3)
Oct  3 11:39:32.778: INFO: expecting there to be 3 replicas (are: 4)
Oct  3 11:39:32.778: INFO: Unexpected error occurred: number of replicas changed unexpectedly

Is the limit not getting set correctly here? Is this an actual bug?

/sig autoscaling /priority important-soon /kind failing-test /kind flake

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 39 (31 by maintainers)

Most upvoted comments

The PR merged. Now let’s wait and see if it solves the problem.

On Friday I verified that CPU usage generated by resource consumer stays really close to the target value but oscillates slightly. To fix this I:

  • Increase CPU usage in the test (resource consumer is implemented in a way that makes me think that deviation fro target are of a fixed size so higher target would make deviation a smaller percent of the target).
  • Lower generated load (the test right now is targeting border between recommendation of 3 instances and 4 instances, I will change the load to something between 2 and 3 instances).

I’m checking if this helps the test.