kubernetes: After upgrading to 1.7.0, Kubelet no longer reports cAdvisor stats
Is this a BUG REPORT or FEATURE REQUEST?: Bug report.
/kind bug
What happened:
I upgraded a cluster from 1.6.6 to 1.7.0. Kubelet no longer reports cAdvisor metrics such as container_cpu_usage_seconds_total on its metrics endpoint (https://node:10250/metrics/). Kubelet’s own metrics are still there. cAdvisor itself (http://node:4194/) does show container metrics.
What you expected to happen:
Nothing in the release notes suggests this interface has changed, so I expected the metrics would still be there.
How to reproduce it (as minimally and precisely as possible):
I don’t know, but I can reproduce it reliably on this cluster; rebooting or reinstalling nodes doesn’t make a difference.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): Server Version: version.Info{Major:“1”, Minor:“7”, GitVersion:“v1.7.0+coreos.0”, GitCommit:“8c1bf133b4129042ef8f7d1ffac1be14ee83ed10”, GitTreeState:“clean”, BuildDate:“2017-06-30T17:46:00Z”, GoVersion:“go1.8.3”, Compiler:“gc”, Platform:“linux/amd64”} - Cloud provider or hardware configuration**: GCE
- OS (e.g. from /etc/os-release): CoreOS 1409.5.0
- Kernel (e.g.
uname -a): Linux staging-worker-710d.c.torchkube.internal 4.11.6-coreos-r1 #1 SMP Thu Jun 22 22:04:38 UTC 2017 x86_64 Intel® Xeon® CPU @ 2.20GHz GenuineIntel GNU/Linux - Install tools: Custom scripts.
- Others:
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 8
- Comments: 43 (29 by maintainers)
Commits related to this issue
- Merge pull request #49079 from smarterclayton/restore_metrics Automatic merge from submit-queue Restore cAdvisor prometheus metrics to the main port But under a new path - `/metrics/cadvisor`. This... — committed to kubernetes/kubernetes by deleted user 7 years ago
- documentation: update Kubernetes example for 1.7 Kubernetes 1.7+ no longer exposes cAdvisor metrics on the Kubelet metrics endpoint. Update the example configuration to scrape cAdvisor in addition t... — committed to unixwitch/prometheus by unixwitch 7 years ago
- documentation: update Kubernetes example for 1.7 (#2918) Kubernetes 1.7+ no longer exposes cAdvisor metrics on the Kubelet metrics endpoint. Update the example configuration to scrape cAdvisor in ... — committed to prometheus/prometheus by unixwitch 7 years ago
@dashpole The problem is that in 1.6 and earlier, port 10255 returned cAdvisor container metrics. The fact it no longer does is an incompatible change which has broken Prometheus, which uses this port to scrape from: https://github.com/prometheus/prometheus/blob/release-1.7/discovery/kubernetes/node.go#L156
If this was intentionally changed, shouldn’t there have been an entry in the release notes?
Does this also mean it’s now impossible to scrape container metrics over TLS (which worked before using port 10250)? That seems like a significant regression in functionality.
I will be working on a fix, will send a PR tomorrow hopefully.
add
kubernetes-cadvisorsjob in prometheus config to fix prometheus miss container_* metrics, if you install prometheus with helm.Sorry to hijack this issue. But there’s clearly a problem with the cadvisor endpoint in 1.7.1. It randomly reports either systemd cgroups or docker containers e.g. for
container_memory_usage_bytes.@grobie Do you expect to change it back so that
:10255/metricsincludes cAdvisor metrics? Or will the fix be something different? I ask because this broke prometheus-operator’s ability to scrape cAdvisor metrics, and I’m wondering if I should propose a change to prometheus-operator to look for metrics on the cAdvisor port, or just hold out for cAdvisor metrics to come back on port 10255.But this outputs JSON, which Prometheus doesn’t understand. There is no way to collect the metrics in Prometheus format any more, at least in kubeadm’s default configuration. (Edit: unless there’s a way to make /stats/ output the metrics in Prometheus format. But I couldn’t find any documentation suggesting that is the case.)
Well, the two changes are unrelated, yes. But the combination of both together is quite unfortunate for Prometheus users as both existing sources of Prometheus-format cAdvisor metrics have been disabled at the same time.
Right. The only way to collect the metrics in Prometheus format is via the cAdvisor HTTP server.
Please don’t hijack issues, it just creates confusion. Once this change is released (presumably with 1.7.3) or building from the release branch before that, please confirm whether your issue persists. If it does, it’s a new issue, please file it separately. If it doesn’t, it was probably related, but is already dealt with.
cc @grobie Ok, so I have tracked the issue down to https://github.com/google/cadvisor/pull/1460. Specifically, changing
prometheus.MustRegister(tor := prometheus.NewRegistry; r.MustRegister(caused the metrics to no longer be displayed on the kubelet’s port10250/metrics, and only on port4194/metrics. Based on the original issue, I don’t think this behavior was intended, although I could be wrong.@unixwitch I finally realized you are using the wrong port. 10255 is the kubelet’s port for prometheus metrics. As you can see, it gives a metric for runtime operation latency. Port 4194 is the cadvisor port, which has container metrics. See if that works.