keda: milli values scale incorrectly
I’m using the kafka ScaledObject. The HPA sees milli values incorrectly scaling them too high up.
In my case lagThreshold is set to “1”. And the HPA sees the target as 1438m/1 (avg).
What’s strange is that it’s scaling up replicas to the max possible number based as though it’s “1438/1” which is definitely incorrect!
Change the target from “1” to “1000” seems to fix the issue? This is definitely a bug though.
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
name: myapp
labels:
deploymentName: myapp # must be in the same namespace as the ScaledObject
spec:
scaleTargetRef:
deploymentName: myapp # must be in the same namespace as the ScaledObject
pollingInterval: 10 # Optional. Default: 30 seconds
cooldownPeriod: 10 # 60 # Optional. Default: 300 seconds
minReplicaCount: 1 # Optional. Default: 0
maxReplicaCount: 20 # Optional. Default: 100
triggers:
- type: kafka
metadata:
brokerList: kafka-kafka-headless.kafka:9092
consumerGroup: myconsumergroup
topic: mytopic
lagThreshold: "1"
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20 (6 by maintainers)
Commits related to this issue
- Fix 'Cloudatch' typo and update casing to match AWS for v1.5 docs (#186) Signed-off-by: Tom Kerkhove <kerkhove.tom@gmail.com> — committed to preflightsiren/keda by tomkerkhove 4 years ago
So the issue of the “m” is really a non-issue. The “m” is simply a way for Kubernetes HPA to show a qualitative custom metric value with an implicit decimal. 1.5 = 1500m that’s it. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#appendix-quantities
Why did they choose this? Working with whole numbers is easier, you just divide by 1000. There may be some internationalization to it where some parts of the world still use a comma instead of a period as a decimal separator, probably less common now. They use a different approach by specifying numbers in a globally understood way, as just whole numbers with suffixes.
Also this “m” is specific to Kubernetes and has nothing to do with Keda or any custom metrics application.
So when debugging your scaling and you see an “m” in the target value make sure you divide it by 1000 so you use the same value that the HPA is using behind the scenes.
Using the Replicas in the calculation is critical for the HPA to know how many pods to adjust up or down. If Keda queried and got 43802 as the number of undelivered messages it would do this 43802 / 498 = 87.956 or 87956m. That is less than the target of 150. 88 / 150 = 59%. So that means that 41% of your pods at that time are extra and it could terminate them.
Conversely if there were fewer replicas like 200 then the math would be 43802 / 200 = 219.01 or 219010m. 219 / 150 = 146% so it needs 46% more pods to handle the current number of messages, so it would scale up more pods. That’s how the math works both ways.
Hey guys, I think I can contribute more data to this issue. I’m facing the same issue with Kafka scaler right now. My numbers are a bit different, but the same problem is here. Few points:
kafka_scaler should print the message of “Group X has a lag of Y…” in info mode according to the code here but only when I changed the logs to debug I saw these prints.
I have topic with 20 partitions but in the logs of keda-opertaor I can only see one print for partition 0:
And that’s the hpa metrics, running at the same time:
I’m not sure why the metrics is with “m” or indicating millis…? But the lag for each partition is around that number (800 - 1000 msgs lag) and the number of min replicas I have is 10. So, my guess was that the hpa formula gets the following: currentReplicas = 10 , currentMetricValue = 881 , desiredMetricValue = 7500 ===> desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )] desiredReplicas = ceil[10 * (881/7500)] = 2 ==> meaning, we keep the min pods. The desiredMetricValue is some workaround I did cause I couldn’t figure out the calculation, but now it seems like a bug in the current metric value (by the Kafka scaler) cause it only sees 1 partition instead of 20… Let me know if I can help with more or provide more details.