kubernetes: HPA counts pods that are not ready and doesn't take action

/kind bug

What happened: HPA

NAME              REFERENCE                      TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
service-entrata   Deployment/service-entrata     0% / 25%          2         100       2          1d

Deployment

NAME                        DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
service-entrata             2         2         2            0           4d

What you expected to happen:

HPA should take action and start 2 new pods

How to reproduce it (as minimally and precisely as possible):

Create a deployment with a strict readiness probe timeoutSeconds=1 periodSeconds=2 and make CPU go up until the the pods become not ready.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.7.3
  • Cloud provider or hardware configuration**: GKE
  • OS (e.g. from /etc/os-release): alpine-node

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 2
  • Comments: 34 (14 by maintainers)

Most upvoted comments

Would it help people if I made the “ready-hpa” public?

I thought that was the whole point, I configured my deployment to “un-ready” the pods if they are being hammered and CPU goes up and wait 30s before the liveness check kills them in the hope that the HPA would be fast enough to scale up.

Ah, you have a fundamentally different definition of “unready” than we assume in the HPA. In order to deal with pod initialization CPU spikes we assume that unready pods will be back soon-ish, and that unready pods are probably unready because they’re not yet ready. This means that we don’t keep scaling up while new pods are starting, and we don’t accidentally overscale because a pod consumes a lot of resources while initializing.

However, it also means (somewhat inadvertently) that when pods go from ready to unready, the HPA is much more conservative about scaling up new pods. I can definitely see how this behavior could be detrimental. I’m imagining a system like this:

  • a readiness probe checks that the maximum request processing time is less that some desired amount
  • one pod goes unready, more traffic is shifted to the rest
  • each existing pod goes unready in turn
  • HPA doesn’t add new pods, since it’s being fairly conservative
  • all pods become unready, and eventually ready again

Ideally, two factors mitigate this issue:

  1. if the other pods are experiencing sufficiently high CPU usage, the HPA should kick in anyway – it’s more conservative about scale ups, but that doesn’t mean it won’t scale up at all.
  2. the HPA should kick in before this happens. This is easier to target once you have a good handle on CPU, or if you autoscale on the same metric that you use to determine readiness. For instance, with HPA v2, you could set your readiness probe to check “free space” in the workqueue, and then autoscale based off of an expected queue length (e.g. go unready at a queue of 10, have HPA scale around a queue length of 5).

Howevever, based on this bug, those two mitigating factors might not be enough. We may want to make some more explicit considerations somehow for pods going unready because they’re “full”. @kubernetes/sig-autoscaling-feature-requests.

It’ll be in 1.12

Is this issue really solved ? I have a very similar use case as OP : we have very sudden traffic spikes, and on top of that we make use of persistent connections (SSE and WebSockets) which cause imbalances in resource consumption between the different pods of a same deployment. As a consequence, we use readiness probes as a way to make pods temporarily unavailable, so that they stop receiving new requests for a little while and are able to cool down, otherwise they would just end up crashing because of overload.

When we have a massive traffic spike, all our pods end up being in the “unready” state before the HPA even realizes that they are over the CPU threshold. And then the HPA seems to ignore the unready pods, says that the overall cpu consumption of the deployment is <unknown>/threshold, and thus doesn’t scale up although the deployment is completely overloaded.

EDIT: after more tests, the behaviour I observed was apparently linked with the parameter --horizontal-pod-autoscaler-cpu-initialization-period, which will discard Unready pods from the CPU calculations during the first 5mn of their lives. After that 5mn period, the HPA works as expected.

This is still an issue, I think I have managed to cut down the tests to demonstrate it. https://gist.github.com/mgazza/f01d03464ac480a2de1fa9f6edeee1f6#file-horizontal_test-go