prometheus-adapter: Prometheus adapter pod unable to get node metrics
What happened?
Deployed Prometheus Adapter (v0.8.4
) via helm chart on EKS (v1.18.16-eks-7737de
) with 2 replicas.
1
replica is returning a result for kubectl top nodes
:
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-4-227.****** 4198m 53% 1289Mi 8%
ip-10-0-4-79.****** 1788m 22% 934Mi 6%
ip-10-0-5-164.****** 4379m 55% 903Mi 6%
ip-10-0-5-85.****** 1666m 21% 926Mi 6%
ip-10-0-6-142.****** 3768m 47% 842Mi 5%
ip-10-0-6-209.****** 1654m 20% 908Mi 6%
but the other replica throws the following error:
$ kubectl top nodes
error: metrics not available yet
The logs in that replica:
$ $ kubectl -n monitoring-adapter logs prometheus-adapter-57d96ff446-97wbw -f
.
.
.
I0519 14:08:18.929794 1 handler.go:143] prometheus-metrics-adapter: GET "/apis/metrics.k8s.io/v1beta1/nodes" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0519 14:08:18.931997 1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621433298.929 200 OK
I0519 14:08:18.932353 1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621433298.929 200 OK
I0519 14:08:18.932759 1 provider.go:282] missing memory for node "ip-10-0-4-227.******", skipping
I0519 14:08:18.932775 1 provider.go:282] missing memory for node "ip-10-0-4-79.******", skipping
I0519 14:08:18.932780 1 provider.go:282] missing memory for node "ip-10-0-5-164.******", skipping
I0519 14:08:18.932785 1 provider.go:282] missing memory for node "ip-10-0-5-85.******", skipping
I0519 14:08:18.932790 1 provider.go:282] missing memory for node "ip-10-0-6-142.******", skipping
I0519 14:08:18.932796 1 provider.go:282] missing memory for node "ip-10-0-6-209.******", skipping
I0519 14:08:18.932905 1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1/nodes" latency="3.582715ms" userAgent="kubectl/v1.18.0 (darwin/amd64) kubernetes/9e99141" srcIP="10.0.6.74:39976" resp=200
.
.
.
Manually, running the same query (api.go
@ 14:08:18.931997
from logs) to prometheus server from inside both replicas, does return the same result:
$ kubectl -n monitoring-adapter exec -it prometheus-adapter-57d96ff446-97wbw sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ $ wget -qO- http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+gr
oup_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"node":"ip-10-0-4-227.******"},"value":[1621431478.914,"1353449472"]},{"metric":{"node":"ip-10-0-4-79.******"},"value":[1621431478.914,"1070182400"]},{"metric":{"node":"ip-10-0-5-164.******"},"value":[1621431478.914,"1006329856"]},{"metric":{"node":"ip-10-0-5-85.******"},"value":[1621431478.914,"938311680"]},{"metric":{"node":"ip-10-0-6-142.******"},"value":[1621431478.914,"877047808"]},{"metric":{"node":"ip-10-0-6-209.******"},"value":[1621431478.914,"956456960"]}]}}/ $
Did you expect to see some different?
Both replicas should be able to return the node metrics for kubectl top nodes
when the node query is working fine.
How to reproduce it (as minimally and precisely as possible): Not really sure, I deleted the pod and the issue go away, but it still happens every now and then (usually with new pods?)
Environment
- Kubernetes version information:
lient Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes cluster kind:
AWS EKS
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (6 by maintainers)
This issue is still present in
v0.10.0
edit: Got this working. Runningv0.10.0
on EKS 1.23, usingkube-prometheus-stack
. I needed to add a relabeling config (https://github.com/prometheus-community/helm-charts/blob/0b928f341240c76d8513534035a825686ed28a4b/charts/kube-prometheus-stack/values.yaml#L471) to the ServiceMonitor for node-exporterAfter that I used this form of the query (https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml)
For the cpu query, the labelMatchers should match
node
and not instance. As for memory, we have some relabeling in place in kube-prometheus for node-exporter: https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/node-exporter-serviceMonitor.yamlI’ll try to reproduce with your query, but with the one from kube-prometheus I wasn’t able to so far.
I was facing the same problem with Amazon EKS version
1.21-eks.2
with both,prometheus-server
andprometheus-adapter
installed using the community Charts, using the example in the README. The version are as below:Following the workaround proposed by @junaid-ali, I was able to make it work changing the association of resource
nodes
to the labelinstance
(instead of the originalnode
). My value file is currently like this:After that, I’m now able to query resource metrics with
kubectl top nodes/pods
:i undid the node overrides and put it back to how the readme has it, seems to have resolved the issue for me
@nicraMarcin I don’t declare any additional rules
@dgrisonnet it’s only happening for nodes. Also, it always returns
error: metrics not available yet
for nodes (and not an intermittent issue); on re-creating the Prometheus Adapter pod, the issue goes away.