prometheus-adapter: Prometheus adapter pod unable to get node metrics

What happened? Deployed Prometheus Adapter (v0.8.4) via helm chart on EKS (v1.18.16-eks-7737de) with 2 replicas.

1 replica is returning a result for kubectl top nodes:

$ kubectl top nodes
NAME                   CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-4-227.******   4198m        53%    1289Mi          8%
ip-10-0-4-79.******    1788m        22%    934Mi           6%
ip-10-0-5-164.******   4379m        55%    903Mi           6%
ip-10-0-5-85.******    1666m        21%    926Mi           6%
ip-10-0-6-142.******   3768m        47%    842Mi           5%
ip-10-0-6-209.******   1654m        20%    908Mi           6%

but the other replica throws the following error:

$ kubectl top nodes
error: metrics not available yet

The logs in that replica:

$ $ kubectl -n monitoring-adapter logs prometheus-adapter-57d96ff446-97wbw  -f
.
.
.
I0519 14:08:18.929794       1 handler.go:143] prometheus-metrics-adapter: GET "/apis/metrics.k8s.io/v1beta1/nodes" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0519 14:08:18.931997       1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621433298.929 200 OK
I0519 14:08:18.932353       1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621433298.929 200 OK
I0519 14:08:18.932759       1 provider.go:282] missing memory for node "ip-10-0-4-227.******", skipping
I0519 14:08:18.932775       1 provider.go:282] missing memory for node "ip-10-0-4-79.******", skipping
I0519 14:08:18.932780       1 provider.go:282] missing memory for node "ip-10-0-5-164.******", skipping
I0519 14:08:18.932785       1 provider.go:282] missing memory for node "ip-10-0-5-85.******", skipping
I0519 14:08:18.932790       1 provider.go:282] missing memory for node "ip-10-0-6-142.******", skipping
I0519 14:08:18.932796       1 provider.go:282] missing memory for node "ip-10-0-6-209.******", skipping
I0519 14:08:18.932905       1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1/nodes" latency="3.582715ms" userAgent="kubectl/v1.18.0 (darwin/amd64) kubernetes/9e99141" srcIP="10.0.6.74:39976" resp=200
.
.
.

Manually, running the same query (api.go @ 14:08:18.931997 from logs) to prometheus server from inside both replicas, does return the same result:

$ kubectl -n monitoring-adapter exec -it prometheus-adapter-57d96ff446-97wbw sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ $ wget -qO- http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+gr
oup_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"node":"ip-10-0-4-227.******"},"value":[1621431478.914,"1353449472"]},{"metric":{"node":"ip-10-0-4-79.******"},"value":[1621431478.914,"1070182400"]},{"metric":{"node":"ip-10-0-5-164.******"},"value":[1621431478.914,"1006329856"]},{"metric":{"node":"ip-10-0-5-85.******"},"value":[1621431478.914,"938311680"]},{"metric":{"node":"ip-10-0-6-142.******"},"value":[1621431478.914,"877047808"]},{"metric":{"node":"ip-10-0-6-209.******"},"value":[1621431478.914,"956456960"]}]}}/ $

Did you expect to see some different? Both replicas should be able to return the node metrics for kubectl top nodes when the node query is working fine.

How to reproduce it (as minimally and precisely as possible): Not really sure, I deleted the pod and the issue go away, but it still happens every now and then (usually with new pods?)

Environment

  • Kubernetes version information:
lient Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes cluster kind:

AWS EKS

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (6 by maintainers)

Most upvoted comments

This issue is still present in v0.10.0 edit: Got this working. Running v0.10.0 on EKS 1.23, using kube-prometheus-stack. I needed to add a relabeling config (https://github.com/prometheus-community/helm-charts/blob/0b928f341240c76d8513534035a825686ed28a4b/charts/kube-prometheus-stack/values.yaml#L471) to the ServiceMonitor for node-exporter

  prometheus-node-exporter:
    prometheus:
      monitor:
        relabelings:
          - sourceLabels: [__meta_kubernetes_pod_node_name]
            separator: ;
            regex: ^(.*)$
            targetLabel: node
            replacement: $1
            action: replace

After that I used this form of the query (https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml)

For the cpu query, the labelMatchers should match node and not instance. As for memory, we have some relabeling in place in kube-prometheus for node-exporter: https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/node-exporter-serviceMonitor.yaml

I’ll try to reproduce with your query, but with the one from kube-prometheus I wasn’t able to so far.

I was facing the same problem with Amazon EKS version 1.21-eks.2 with both, prometheus-server and prometheus-adapter installed using the community Charts, using the example in the README. The version are as below:

$ helm ls -n prometheus-system 
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                       APP VERSION
prometheus              prometheus-system       1               2021-10-12 01:27:02.127990141 +0000 UTC deployed        prometheus-14.9.2           2.26.0     
prometheus-adapter      prometheus-system       6               2021-10-14 19:20:03.800011211 +0000 UTC deployed        prometheus-adapter-2.17.0   v0.9.0     

Following the workaround proposed by @junaid-ali, I was able to make it work changing the association of resource nodes to the label instance (instead of the original node). My value file is currently like this:

prometheus:
  path: ""
  port: 80
  url: http://prometheus-server.prometheus-system.svc
rules:
  resource:
    cpu:
      containerLabel: container
      containerQuery: sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,
        container!=""}[3m])) by (<<.GroupBy>>)
      nodeQuery: sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>, id='/'}[3m]))
        by (<<.GroupBy>>)
      resources:
        overrides:
          instance:
            resource: node
          namespace:
            resource: namespace
          pod:
            resource: pod
    memory:
      containerLabel: container
      containerQuery: sum(container_memory_working_set_bytes{<<.LabelMatchers>>, container!=""})
        by (<<.GroupBy>>)
      nodeQuery: sum(container_memory_working_set_bytes{<<.LabelMatchers>>,id='/'})
        by (<<.GroupBy>>)
      resources:
        overrides:
          instance:
            resource: node
          namespace:
            resource: namespace
          pod:
            resource: pod
    window: 3m

After that, I’m now able to query resource metrics with kubectl top nodes/pods:

$ kubectl top nodes
NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-1-10-60.ec2.internal    47m          2%     657Mi           9%        
ip-10-1-14-239.ec2.internal   76m          3%     1121Mi          16%       

$ kubectl top pods -A
NAMESPACE           NAME                                             CPU(cores)   MEMORY(bytes)   
kube-system         aws-node-2h4k6                                   3m           47Mi            
kube-system         aws-node-fdspx                                   3m           48Mi            
kube-system         coredns-66cb55d4f4-7g7x4                         0m           10Mi            
kube-system         coredns-66cb55d4f4-7wzsc                         1m           9Mi             
kube-system         kube-proxy-fd9ps                                 0m           13Mi            
kube-system         kube-proxy-fsbwq                                 0m           13Mi            
prometheus-system   prometheus-adapter-8bcbbfb8b-gv8m8               10m          39Mi            
prometheus-system   prometheus-alertmanager-787f86875f-x9skk         0m           12Mi            
prometheus-system   prometheus-kube-state-metrics-58c5cd6ddb-666td   0m           11Mi            
prometheus-system   prometheus-node-exporter-4rh98                   0m           7Mi             
prometheus-system   prometheus-node-exporter-5xv2s                   0m           7Mi             
prometheus-system   prometheus-pushgateway-6bd6fcd9b8-m4nmg          0m           7Mi             
prometheus-system   prometheus-server-648c978678-9dbbx               13m          370Mi  

i undid the node overrides and put it back to how the readme has it, seems to have resolved the issue for me

@nicraMarcin I don’t declare any additional rules

@dgrisonnet it’s only happening for nodes. Also, it always returns error: metrics not available yet for nodes (and not an intermittent issue); on re-creating the Prometheus Adapter pod, the issue goes away.