kubernetes: kube 1.13.5 - Nodes backed w/ CRI-O unable to properly autoscale pods due to metrics-server/kubelet issue
What happened: I built a cluster with a mix of docker and cri-o nodes in order to a/b test the runtimes.
After observing the logs of the metrics-server and running some tests with the HPA I determined that if the ‘seed’ pod landed on a cri-o node the hpa would be ineffective. I dug in a bit.
Logs from the metrics server contained large blocks of similar lines, all referencing cri-o nodes (in this case I’m pulling out php-apache reference because that is the autoscaling sample pod i’m using):
unable to fully scrape metrics from source kubelet_summary:ip-172-27-190-56.us-west-2.compute.internal: [unable to get CPU for container "php-apache" in pod default/php-apache-5c7c7d8b44-5b594 on node "ip-172-27-190-56.us-west-2.compute.internal", discarding data: missing cpu usage metric,
and once I set up the HPA I see this as well:
E0417 14:54:27.571050 1 reststorage.go:144] unable to fetch pod metrics for pod default/php-apache-5c7c7d8b44-5b594: no metrics known for pod
I0417 14:54:27.571164 1 wrap.go:42] GET /apis/metrics.k8s.io/v1beta1/namespaces/default/pods?labelSelector=run%3Dphp-apache: (2.312ms) 200 [[hyperkube/v1.13.5 (linux/amd64) kubernetes/2166946/system:serviceaccount:kube-system:horizontal-pod-autoscaler] 25.128.9.0:43710]
I then kicked the metrics-server logging into high gear and saw the following output related to the pod:
"pods": [
{
"podRef": {
"name": "php-apache-5c7c7d8b44-5b594",
"namespace": "default",
"uid": "a1dd863a-611f-11e9-89c4-0af8e487e484"
},
"startTime": "2019-04-17T14:46:50Z",
"containers": [
{
"name": "php-apache",
"startTime": "2019-04-17T14:46:51Z",
"cpu": {
"time": "2019-04-17T14:54:17Z",
"usageCoreNanoSeconds": 197045014066
},
"memory": {
"time": "2019-04-17T14:54:17Z",
"workingSetBytes": 16564224
},
"rootfs": {
"time": "2019-04-17T14:54:17Z",
"availableBytes": 231747842048,
"capacityBytes": 250566086656,
"usedBytes": 36866,
"inodesFree": 65349953,
"inodes": 65536000,
"inodesUsed": 11
},
"logs": {
"time": "2019-04-17T14:54:16Z",
"availableBytes": 199483015168,
"capacityBytes": 210304475136,
"usedBytes": 262144,
"inodesFree": 13106664,
"inodes": 13107200,
"inodesUsed": 2
},
"userDefinedMetrics": null
}
],
"cpu": {
"time": "2019-04-17T14:54:12Z",
"usageNanoCores": 974840417,
"usageCoreNanoSeconds": 191864605193
},
"memory": {
"time": "2019-04-17T14:54:12Z",
"usageBytes": 22196224,
"workingSetBytes": 21954560,
"rssBytes": 7880704,
"pageFaults": 0,
"majorPageFaults": 0
},
"volume": [
{
"time": "2019-04-17T14:47:24Z",
"availableBytes": 84500426752,
"capacityBytes": 84500439040,
"usedBytes": 12288,
"inodesFree": 20629981,
"inodes": 20629990,
"inodesUsed": 9,
"name": "default-token-zskwg"
}
],
"ephemeral-storage": {
"time": "2019-04-17T14:54:17Z",
"availableBytes": 199483015168,
"capacityBytes": 210304475136,
"usedBytes": 299010,
"inodesFree": 13106664,
"inodes": 13107200,
"inodesUsed": 11
}
},
...
As you can see usageNanoCores is preset at pods[].cpu.usageNanoCores and is absent at pods[].containers[].cpu. On docker nodes this is not the case.
The HPA contains the following events:
Warning FailedGetResourceMetric 1s horizontal-pod-autoscaler unable to get metrics for resource cpu: no metrics returned from resource metrics API
Warning FailedComputeMetricsReplicas 1s horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
I’ve found some reference issues:
https://github.com/kubernetes/kubernetes/issues/71712 (I made this change, but it didn’t have any effect. I imagine if it was actually a problem I wouldn’t receive any stats?)
https://github.com/kubernetes/kubernetes/issues/75934
https://github.com/kubernetes/kubernetes/issues/72803
https://github.com/kubernetes-incubator/metrics-server/issues/172
and also these pull requests, which I think is meant to address #75934 above
https://github.com/kubernetes/kubernetes/pull/73659
https://github.com/kubernetes/kubernetes/pull/74933
Both of which are merged but only into the 1.14 releases…
I’m about to spin up a 1.14 cluster for tests. But assuming it’s all good there, what’s the prospect for getting these changes pulled back in 1.13.x and possibly 1.12.x?
What you expected to happen:
I expect that the kubelet/cri-o/metrics-server/hpa controller would all work together as expected.
How to reproduce it (as minimally and precisely as possible):
Create a 1.13.5 cluster with metrics-server and nodes backed with cri-o. Apply hpa, watch it fail to work.
Anything else we need to know?:
I just built another test cluster running 1.14.1 and cri-o 1.14 and I’m still experiencing the same events on the HPA, digging in to see if the metrics endpoint on the kubelet is still exhibiting the same behaviors.
Environment:
- Kubernetes version (use
kubectl version): Server Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.5”, GitCommit:“2166946f41b36dea2c4626f90a77706f426cdea2”, GitTreeState:“clean”, BuildDate:“2019-03-25T15:19:22Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”} - Cloud provider or hardware configuration: aws
- OS (e.g:
cat /etc/os-release): Container Linux by CoreOS 2023.4.0 (Rhyolite) - Kernel (e.g.
uname -a): N/A - Install tools: custom bootstrap w/ bootkube
- Others:
/sig node /sig Instrumentation
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (10 by maintainers)
So I put up a new 1.14.1 cluster just to confirm that I wasn’t crazy and I wasn’t. Clearly something I did the first time was in error.
So yeah 1.14.1 + metricsserver + crio seems just fine now…