metrics-server: Metrics not available in some namespaces (while /proxy/stats/summary shows all metrics)
What happened: Metrics not available for pods in some namespaces, though raw metrics are available.
We noticed that we have podmetrics for all pods in certain namespaces and then no podmetrics for others. E.g.
$ k top pod -n monitoring
NAME CPU(cores) MEMORY(bytes)
alertmanager-promop-kube-prometheus-sta-alertmanager-0 1m 30Mi
promop-grafana-57d98d4d97-bjstt 2m 278Mi
promop-kube-prometheus-sta-operator-74b7b4c58f-7n6qc 1m 39Mi
promop-kube-state-metrics-855f4f596f-l8hkc 2m 26Mi
promop-prometheus-node-exporter-8km7m 1m 11Mi
promop-prometheus-node-exporter-clkk5 11m 32Mi
promop-prometheus-node-exporter-tmzrd 6m 25Mi
$ k top pod -n dummy-app-1
error: Metrics not available for pod dummy-app-1/adservice-6c498b7f49-mr4wx, age: 12h10m58.089684631s
However when we query the raw stats through k8s api for that namespace, we see there are metrics
kubectl get --raw /api/v1/nodes/worker1/proxy/stats/summary | jq '.pods[] | select(.podRef.namespace == "dummy-app-1") | {name: .podRef.name, namespace: .podRef.namespace, containers: .containers }'
...
{
"name": "currencyservice-5b648f7477-sgqfs",
"namespace": "dummy-app-1",
"containers": [
{
"name": "server",
"startTime": "2022-03-03T21:54:22Z",
"cpu": {
"time": "2022-03-04T10:05:42Z",
"usageNanoCores": 2694265,
"usageCoreNanoSeconds": 139218220640
},
"memory": {
"time": "2022-03-04T10:05:42Z",
"availableBytes": 106565632,
"usageBytes": 30085120,
"workingSetBytes": 27652096,
"rssBytes": 25960448,
"pageFaults": 11943030,
"majorPageFaults": 0
},
"rootfs": {
"time": "2022-03-04T10:05:42Z",
"availableBytes": 3105189888,
"capacityBytes": 16776077312,
"usedBytes": 24576,
"inodesFree": 874723,
"inodes": 1048576,
"inodesUsed": 7
},
"logs": {
"time": "2022-03-04T10:05:42Z",
"availableBytes": 3105189888,
"capacityBytes": 16776077312,
"usedBytes": 5111808,
"inodesFree": 874723,
"inodes": 1048576,
"inodesUsed": 173853
}
},
{
"name": "linkerd-proxy",
"startTime": "2022-03-03T21:54:22Z",
"cpu": {
"time": "2022-03-04T10:05:38Z",
"usageNanoCores": 641442,
"usageCoreNanoSeconds": 35109315232
},
"memory": {
"time": "2022-03-04T10:05:38Z",
"availableBytes": 262320128,
"usageBytes": 6385664,
"workingSetBytes": 6115328,
"rssBytes": 4456448,
"pageFaults": 6204,
"majorPageFaults": 0
},
"rootfs": {
"time": "2022-03-04T10:05:38Z",
"availableBytes": 3105189888,
"capacityBytes": 16776077312,
"usedBytes": 45056,
"inodesFree": 874723,
"inodes": 1048576,
"inodesUsed": 14
},
"logs": {
"time": "2022-03-04T10:05:38Z",
"availableBytes": 3105189888,
"capacityBytes": 16776077312,
"usedBytes": 4096,
"inodesFree": 874723,
"inodes": 1048576,
"inodesUsed": 173853
}
},
{
"name": "linkerd-init",
"startTime": "2022-03-03T21:54:21Z",
"cpu": {
"time": "2022-03-04T10:05:39Z",
"usageNanoCores": 0,
"usageCoreNanoSeconds": 0
},
"memory": {
"time": "2022-03-04T10:05:39Z",
"workingSetBytes": 0
},
"rootfs": {
"time": "2022-03-04T10:05:39Z",
"availableBytes": 3105189888,
"capacityBytes": 16776077312,
"usedBytes": 8192,
"inodesFree": 874723,
"inodes": 1048576,
"inodesUsed": 3
},
"logs": {
"time": "2022-03-04T10:05:39Z",
"availableBytes": 3105189888,
"capacityBytes": 16776077312,
"usedBytes": 8192,
"inodesFree": 874723,
"inodes": 1048576,
"inodesUsed": 173853
}
}
...
We get metrics for all pods in that namespace through raw stats.
The metrics server does not give us any relevant logs to explain this behavior, it’s as if the metrics server is not aware of these pods?
To try and resolve I have
- reinstalled the metrics server and RBAC rules (from a raw deployment to Helm install with values supplied below)
- restarted crio and kubelet service on all nodes
What you expected to happen: All metrics available or an error message in metrics server logs
Environment:
-
Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): baremetal cluster setup with kubeadm (with CRIO container runtime)
-
Container Network Setup (flannel, calico, etc.): flannel
-
Kubernetes version (use
kubectl version): 1.22.3 -
Metrics Server manifest
spoiler for Metrics Server manifest:
Installed through Helm with following values file
defaultArgs:
- --cert-dir=/tmp
- --kubelet-preferred-address-types=InternalIP
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls
$ helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
metrics-server kube-system 4 2022-03-04 10:25:47.046764861 +0100 CET deployed metrics-server-3.8.2 0.6.1
- Kubelet config:
spoiler for Kubelet config:
- Metrics server logs:
spoiler for Metrics Server logs:
│ I0304 10:00:50.888860 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key) │
│ I0304 10:00:51.457667 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController │
│ I0304 10:00:51.457734 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController │
│ I0304 10:00:51.457682 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" │
│ I0304 10:00:51.457760 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file │
│ I0304 10:00:51.457783 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key" │
│ I0304 10:00:51.457691 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" │
│ I0304 10:00:51.457824 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file │
│ I0304 10:00:51.457917 1 secure_serving.go:266] Serving securely on :4443 │
│ I0304 10:00:51.457942 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" │
│ W0304 10:00:51.457982 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed │
│ I0304 10:00:51.558835 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file │
│ I0304 10:00:51.558863 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController │
│ I0304 10:00:51.558909 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
- Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/instance=metrics-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=metrics-server
app.kubernetes.io/version=0.6.1
helm.sh/chart=metrics-server-3.8.2
Annotations: meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2022-03-04T09:03:33Z
Managed Fields:
API Version: apiregistration.k8s.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:meta.helm.sh/release-name:
f:meta.helm.sh/release-namespace:
f:labels:
.:
f:app.kubernetes.io/instance:
f:app.kubernetes.io/managed-by:
f:app.kubernetes.io/name:
f:app.kubernetes.io/version:
f:helm.sh/chart:
f:spec:
f:group:
f:groupPriorityMinimum:
f:insecureSkipTLSVerify:
f:service:
.:
f:name:
f:namespace:
f:port:
f:version:
f:versionPriority:
Manager: helm
Operation: Update
Time: 2022-03-04T09:03:33Z
API Version: apiregistration.k8s.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
.:
k:{"type":"Available"}:
.:
f:lastTransitionTime:
f:message:
f:reason:
f:status:
f:type:
Manager: kube-apiserver
Operation: Update
Subresource: status
Time: 2022-03-04T09:03:33Z
Resource Version: 14167942
UID: 2255504c-91ab-4240-881f-ab8f8a1d9eb5
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2022-03-04T09:36:24Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events: <none>
/kind bug
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (12 by maintainers)
@MisterTimn This is expected, problem is caused by Kubelet reporting zeroed metrics for init container that are no longer running. Metrics Server discards whole pod assuming that pod is broken/crashlooping etc. However, it’s expected that Kubelet/container runtime will cleanup init container and no longer report them.
I expect that empty
linkerd-initcontainer metrics are responsible for Metrics Server discarding whole pod. To avoid reporting invalid metrics for pod (container restarting, metrics not yet collected) MS will ignore pods that have any container report zero values for usage.See:
So problem is not related to namespace, but to linkerd mesh being enabled only in certain namespaces. If you want to confirm it in logs you need to increase log verbosity. I expect to at least -v=2 or more, container skip logs require high verbosity set as they can happen in normal situation when container starts.
As for fix the issue here is that I would not expect that Kubelet should report metrics for init container after the container is stopped. However this is pretty hard solely on MS side as this is a CRI/Kubelet issue.