serving: pod number decreased extremely when panic-window passed

In what area(s)?

/area autoscale

Ask your question here:

We are trying to test knative autoscaling with a simple helloworld Golang application, and trying to scale as much as possible pods with knative to check the behavior under extreme situation.

So, we set CC=10, and use vegeta to add workload with rate 3000/s (given it is helloworld, the rate here is rps, not concurrency).
queue-proxy / user-container cpu.limit are set to 100m, memory.limit are set to 1Gbi tbc=0 to ensure no activator in path

See the picture below,

The yellow line is the stable-request-concurrency of the target application fetched from autoscaler The blue line is the panic-request-concurrency of the target application fetched from autoscaler The purple line is the desired pod number reported by autoscaler. The green line is the running pod number from kubectl get pods | grep -c running

The issue I would like to raise here is that , why the pod number decreased extremely and then increased again? Can we avoid the fluctuation like that ?

Let us see the picture with more details.

First, when the workloads increased, the panic concurrency is very high, up to 204 pods are added and running during the panic window Then, with the 200+ pods running, the incoming requests can be consumed by the application very quickly, with results in very very low concurrency, so the panic concurrency decreased. After a while, even stable-request-concurrency decreased.

After 60 seconds, the given panic window expires … When the desired pod is computed by the stable-request-concurrency, but even the stable-concurrency is not high enough now … The desired pod number decreased quickly. https://github.com/knative/serving/blob/fde07b7db56b18a117ea36a5a0e978df427b1cff/pkg/autoscaler/scaling/autoscaler.go#L214-L222

In above picture, we saw the desired pod is 204 on 23:22:02 with panic mode =1 Then, the desired pod is 51 on 23:22:07 (5 seconds later) with panic mode =0 Furthermore, the desired pod is 1 on 23:22:19 (17 seconds later)

Then ,the desired pod setting took a little bit time to take effects. With less pod number, panic concurrency increased extremely again when the running pod number reached to 38. As a result , a new panic window cycle starts, and repeats over and over. That is why I saw the fluctuation.

Question:

Is that possible to consider enhance autoscaler algorithm to avoid the unstable scenario?
For example, once the panic window is over, if the desired pod is much less than current running pod value (i.e less than 50% of current running pod number), then add a compensation to make the scale down not so aggressive.

i.e. the most simplest way …

desiredPodCount := desiredStablePodCount
if   desiredPodCount < 0.5 * originalReadyPodsCount {
  desiredPodCount = 0.5 * originalReadyPodsCount 
}

Any comment?

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 17 (17 by maintainers)

Most upvoted comments

@cdlliuy pushed a new version to same branch that applies the delay to the desiredPodCount rather than the stable metric (i.e. it applies to the output of algorithm instead of the input). Think that’ll work better here because it will account for panic pods properly.

julz on Aug 21, 2020