prometheus: After GKE 1.6 upgrade kubernetes nodes metrics endpoint returns 401

What did you do?

After upgrading a GKE cluster - both master and nodes - to 1.6.0 the job_name: 'kubernetes-nodes' as specified in the k8s configuration example results in all the node /metrics endpoints returning

server returned HTTP status 401 Unauthorized

What did you expect to see?

The node /metrics endpoints to be scraped as before upgrading to 1.6.0 (previous version was 1.5.6).

What did you see instead? Under which circumstances?

All the endpoints for kubernetes-nodes as down with the server returned HTTP status 401 Unauthorized error.

Environment

Google Container Engine version 1.6.0
  • System information:
Linux 4.4.21+ x86_64
  • Prometheus version:
prometheus, version 1.5.2 (branch: master, revision: bd1182d29f462c39544f94cc822830e1c64cf55b)
  build user:       root@1a01c5f68840
  build date:       20170210-16:23:28
  go version:       go1.7.5
  • Prometheus configuration file:
- job_name: 'kubernetes-nodes'

  scheme: https

  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

  kubernetes_sd_configs:
  - role: node

  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 31 (19 by maintainers)

Commits related to this issue

Most upvoted comments

For my Prometheus server running inside GKE I now have it running with the following relabeling:

relabel_configs:
- action: labelmap
  regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
  replacement: kubernetes.default.svc.cluster.local:443
- target_label: __scheme__
  replacement: https
- source_labels: [__meta_kubernetes_node_name]
  regex: (.+)
  target_label: __metrics_path__
  replacement: /api/v1/nodes/${1}/proxy/metrics

And the following ClusterRole bound to the service account used by Prometheus:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]

Because the GKE cluster still has an ABAC fallback in case RBAC fails I’m not 100% sure yet this covers all required permissions.

It turned out that for some reason the kubelet is no longer accessible over https on port 10250. Changing the scrape address to use http and port 10255 provides an acceptable workaround for now:

      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__address__]
        action: replace
        target_label: __address__
        regex: ([^:;]+):(\d+)
        replacement: ${1}:10255
      - source_labels: [__scheme__]
        action: replace
        target_label: __scheme__
        regex: https
        replacement: http

You can access node metrics by hitting the kubernetes master, e.g.:

 https://<master-ip?/api/v1/nodes/gke-cluster-1-default-pool-b1eaf580-79km/proxy/metrics

Or you can use TLS client auth

It turned out that for some reason the kubelet is no longer accessible over https on port 10250. Changing the scrape address to use http and port 10255 provides an acceptable workaround for now:

      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__address__]
        action: replace
        target_label: __address__
        regex: ([^:;]+):(\d+)
        replacement: ${1}:10255
      - source_labels: [__scheme__]
        action: replace
        target_label: __scheme__
        regex: https
        replacement: http

Hi how did you changed the port from 10250 to 10255, since for me its not working on 10255 but when I’m curlig on ip:10250 it gives me output.

YOu can’t hit the nodes directly. You need to use the config here:

https://github.com/prometheus/prometheus/issues/2606#issuecomment-294869099