kubernetes: kube 1.13.5 - Nodes backed w/ CRI-O unable to properly autoscale pods due to metrics-server/kubelet issue

What happened: I built a cluster with a mix of docker and cri-o nodes in order to a/b test the runtimes.

After observing the logs of the metrics-server and running some tests with the HPA I determined that if the ‘seed’ pod landed on a cri-o node the hpa would be ineffective. I dug in a bit.

Logs from the metrics server contained large blocks of similar lines, all referencing cri-o nodes (in this case I’m pulling out php-apache reference because that is the autoscaling sample pod i’m using):

unable to fully scrape metrics from source kubelet_summary:ip-172-27-190-56.us-west-2.compute.internal: [unable to get CPU for container "php-apache" in pod default/php-apache-5c7c7d8b44-5b594 on node "ip-172-27-190-56.us-west-2.compute.internal", discarding data: missing cpu usage metric,

and once I set up the HPA I see this as well:

E0417 14:54:27.571050       1 reststorage.go:144] unable to fetch pod metrics for pod default/php-apache-5c7c7d8b44-5b594: no metrics known for pod
I0417 14:54:27.571164       1 wrap.go:42] GET /apis/metrics.k8s.io/v1beta1/namespaces/default/pods?labelSelector=run%3Dphp-apache: (2.312ms) 200 [[hyperkube/v1.13.5 (linux/amd64) kubernetes/2166946/system:serviceaccount:kube-system:horizontal-pod-autoscaler] 25.128.9.0:43710]

I then kicked the metrics-server logging into high gear and saw the following output related to the pod:

"pods": [
   {
    "podRef": {
     "name": "php-apache-5c7c7d8b44-5b594",
     "namespace": "default",
     "uid": "a1dd863a-611f-11e9-89c4-0af8e487e484"
    },
    "startTime": "2019-04-17T14:46:50Z",
    "containers": [
     {
      "name": "php-apache",
      "startTime": "2019-04-17T14:46:51Z",
      "cpu": {
       "time": "2019-04-17T14:54:17Z",
       "usageCoreNanoSeconds": 197045014066
      },
      "memory": {
       "time": "2019-04-17T14:54:17Z",
       "workingSetBytes": 16564224
      },
      "rootfs": {
       "time": "2019-04-17T14:54:17Z",
       "availableBytes": 231747842048,
       "capacityBytes": 250566086656,
       "usedBytes": 36866,
       "inodesFree": 65349953,
       "inodes": 65536000,
       "inodesUsed": 11
      },
      "logs": {
       "time": "2019-04-17T14:54:16Z",
       "availableBytes": 199483015168,
       "capacityBytes": 210304475136,
       "usedBytes": 262144,
       "inodesFree": 13106664,
       "inodes": 13107200,
       "inodesUsed": 2
      },
      "userDefinedMetrics": null
     }
    ],
    "cpu": {
     "time": "2019-04-17T14:54:12Z",
     "usageNanoCores": 974840417,
     "usageCoreNanoSeconds": 191864605193
    },
    "memory": {
     "time": "2019-04-17T14:54:12Z",
     "usageBytes": 22196224,
     "workingSetBytes": 21954560,
     "rssBytes": 7880704,
     "pageFaults": 0,
     "majorPageFaults": 0
    },
    "volume": [
     {
      "time": "2019-04-17T14:47:24Z",
      "availableBytes": 84500426752,
      "capacityBytes": 84500439040,
      "usedBytes": 12288,
      "inodesFree": 20629981,
      "inodes": 20629990,
      "inodesUsed": 9,
      "name": "default-token-zskwg"
     }
    ],
    "ephemeral-storage": {
     "time": "2019-04-17T14:54:17Z",
     "availableBytes": 199483015168,
     "capacityBytes": 210304475136,
     "usedBytes": 299010,
     "inodesFree": 13106664,
     "inodes": 13107200,
     "inodesUsed": 11
    }
   },
...

As you can see usageNanoCores is preset at pods[].cpu.usageNanoCores and is absent at pods[].containers[].cpu. On docker nodes this is not the case.

The HPA contains the following events:

  Warning  FailedGetResourceMetric       1s    horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  1s    horizontal-pod-autoscaler  failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API

I’ve found some reference issues:

https://github.com/kubernetes/kubernetes/issues/71712 (I made this change, but it didn’t have any effect. I imagine if it was actually a problem I wouldn’t receive any stats?)

https://github.com/kubernetes/kubernetes/issues/75934

https://github.com/kubernetes/kubernetes/issues/72803

https://github.com/kubernetes-incubator/metrics-server/issues/172

and also these pull requests, which I think is meant to address #75934 above

https://github.com/kubernetes/kubernetes/pull/73659

https://github.com/kubernetes/kubernetes/pull/74933

Both of which are merged but only into the 1.14 releases…

I’m about to spin up a 1.14 cluster for tests. But assuming it’s all good there, what’s the prospect for getting these changes pulled back in 1.13.x and possibly 1.12.x?

What you expected to happen:

I expect that the kubelet/cri-o/metrics-server/hpa controller would all work together as expected.

How to reproduce it (as minimally and precisely as possible):

Create a 1.13.5 cluster with metrics-server and nodes backed with cri-o. Apply hpa, watch it fail to work.

Anything else we need to know?:

I just built another test cluster running 1.14.1 and cri-o 1.14 and I’m still experiencing the same events on the HPA, digging in to see if the metrics endpoint on the kubelet is still exhibiting the same behaviors.

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.5”, GitCommit:“2166946f41b36dea2c4626f90a77706f426cdea2”, GitTreeState:“clean”, BuildDate:“2019-03-25T15:19:22Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration: aws
OS (e.g: cat /etc/os-release): Container Linux by CoreOS 2023.4.0 (Rhyolite)
Kernel (e.g. uname -a): N/A
Install tools: custom bootstrap w/ bootkube
Others:

/sig node /sig Instrumentation

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 18 (10 by maintainers)

Most upvoted comments

So I put up a new 1.14.1 cluster just to confirm that I wasn’t crazy and I wasn’t. Clearly something I did the first time was in error.

So yeah 1.14.1 + metricsserver + crio seems just fine now…

SleepyBrett on May 1, 2019