kubernetes: After upgrading to 1.7.0, Kubelet no longer reports cAdvisor stats

Is this a BUG REPORT or FEATURE REQUEST?: Bug report.

/kind bug

What happened:

I upgraded a cluster from 1.6.6 to 1.7.0. Kubelet no longer reports cAdvisor metrics such as container_cpu_usage_seconds_total on its metrics endpoint (https://node:10250/metrics/). Kubelet’s own metrics are still there. cAdvisor itself (http://node:4194/) does show container metrics.

What you expected to happen:

Nothing in the release notes suggests this interface has changed, so I expected the metrics would still be there.

How to reproduce it (as minimally and precisely as possible):

I don’t know, but I can reproduce it reliably on this cluster; rebooting or reinstalling nodes doesn’t make a difference.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“7”, GitVersion:“v1.7.0+coreos.0”, GitCommit:“8c1bf133b4129042ef8f7d1ffac1be14ee83ed10”, GitTreeState:“clean”, BuildDate:“2017-06-30T17:46:00Z”, GoVersion:“go1.8.3”, Compiler:“gc”, Platform:“linux/amd64”}
  • Cloud provider or hardware configuration**: GCE
  • OS (e.g. from /etc/os-release): CoreOS 1409.5.0
  • Kernel (e.g. uname -a): Linux staging-worker-710d.c.torchkube.internal 4.11.6-coreos-r1 #1 SMP Thu Jun 22 22:04:38 UTC 2017 x86_64 Intel® Xeon® CPU @ 2.20GHz GenuineIntel GNU/Linux
  • Install tools: Custom scripts.
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 8
  • Comments: 43 (29 by maintainers)

Commits related to this issue

Most upvoted comments

@dashpole The problem is that in 1.6 and earlier, port 10255 returned cAdvisor container metrics. The fact it no longer does is an incompatible change which has broken Prometheus, which uses this port to scrape from: https://github.com/prometheus/prometheus/blob/release-1.7/discovery/kubernetes/node.go#L156

If this was intentionally changed, shouldn’t there have been an entry in the release notes?

Does this also mean it’s now impossible to scrape container metrics over TLS (which worked before using port 10250)? That seems like a significant regression in functionality.

I will be working on a fix, will send a PR tomorrow hopefully.

add kubernetes-cadvisors job in prometheus config to fix prometheus miss container_* metrics, if you install prometheus with helm.

      - job_name: 'kubernetes-cadvisors'

        # Default to scraping over https. If required, just disable this or change to
        # `http`.
        scheme: https

        # This TLS & bearer token file config is used to connect to the actual scrape
        # endpoints for cluster components. This is separate to discovery auth
        # configuration because discovery & scraping are two separate concerns in
        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
        # the cluster. Otherwise, more config options have to be provided within the
        # <kubernetes_sd_config>.
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          # If your node certificates are self-signed or use a different CA to the
          # master CA, then disable certificate verification below. Note that
          # certificate verification is an integral part of a secure infrastructure
          # so this should only be disabled in a controlled environment. You can
          # disable certificate verification by uncommenting the line below.
          #
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
          - role: node

        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}:4194/proxy/metrics

Sorry to hijack this issue. But there’s clearly a problem with the cadvisor endpoint in 1.7.1. It randomly reports either systemd cgroups or docker containers e.g. for container_memory_usage_bytes.

@grobie Do you expect to change it back so that :10255/metrics includes cAdvisor metrics? Or will the fix be something different? I ask because this broke prometheus-operator’s ability to scrape cAdvisor metrics, and I’m wondering if I should propose a change to prometheus-operator to look for metrics on the cAdvisor port, or just hold out for cAdvisor metrics to come back on port 10255.

cAdvisor is run inside of the kubelet and still accessible at <node-ip>:10250/stats/

But this outputs JSON, which Prometheus doesn’t understand. There is no way to collect the metrics in Prometheus format any more, at least in kubeadm’s default configuration. (Edit: unless there’s a way to make /stats/ output the metrics in Prometheus format. But I couldn’t find any documentation suggesting that is the case.)

I think that that is unrelated to the issue being present here

Well, the two changes are unrelated, yes. But the combination of both together is quite unfortunate for Prometheus users as both existing sources of Prometheus-format cAdvisor metrics have been disabled at the same time.

Even though cAdvisor is externally accessible kubelet won’t show these container metrics in its API, right?

Right. The only way to collect the metrics in Prometheus format is via the cAdvisor HTTP server.

Please don’t hijack issues, it just creates confusion. Once this change is released (presumably with 1.7.3) or building from the release branch before that, please confirm whether your issue persists. If it does, it’s a new issue, please file it separately. If it doesn’t, it was probably related, but is already dealt with.

cc @grobie Ok, so I have tracked the issue down to https://github.com/google/cadvisor/pull/1460. Specifically, changing prometheus.MustRegister( to r := prometheus.NewRegistry; r.MustRegister( caused the metrics to no longer be displayed on the kubelet’s port 10250/metrics, and only on port 4194/metrics. Based on the original issue, I don’t think this behavior was intended, although I could be wrong.

@unixwitch I finally realized you are using the wrong port. 10255 is the kubelet’s port for prometheus metrics. As you can see, it gives a metric for runtime operation latency. Port 4194 is the cadvisor port, which has container metrics. See if that works.