prometheus: High CPU usage (32 vCPUs) - looks due to targets discovery in K8s

What did you do?

We’re monitoring a Kubernetes cluster consisting of about 400 nodes and 4500 pods, leveraging a single Prometheus instance with 32 vCPUs (almost fully utilized, while memory hovers between 40-50Gi). The setup leverages Prometheus Operator and most of the targets come from Service Monitors definitions (shouldn’t be too relevant for the issue though). There are about 130 target pools, with a few of those that result each in a few hundreds pods to scrape (a handful can have a couple thousands pods). Judging by the CPU profiling graph, it looks like most of the CPU is used to update those target pools. pprof.prometheus.samples.cpu.005.pb.gz

prof-prod1c-20201001

EDIT: We’re experiencing the same in another cluster with much less total Pods (~1500) but way higher target pools (~450).

What did you expect to see?

Not exactly sure what overall CPU usage to expect for such load, but definitely not >60% of 32vCPUs for targets discovery alone.

In case this usage is expected (and provided it is indeed coming from targets discovery), I would expect to be able to set a custom interval for targets update to tune such behavior or some other ways to reduce CPU footprint.

What did you see instead? Under which circumstances?

32 vCPUs (almost fully utilized), >60% of which seems to be related to targets discovery.

I see about 80 of such pools taking more than 5 seconds to get synched (varying between 4 and 8 seconds). If my understanding is correct, the sync is executed every 5 seconds (https://github.com/prometheus/prometheus/blob/bd53b5ff37ec414d0f22315f2b4050e5c0a44652/scrape/manager.go#L158).

Environment

  • System information:

Linux 4.15.0-1093-azure x86_64

  • Prometheus version:

prometheus, version 2.20.1 (branch: HEAD, revision: 983ebb4a513302315a8117932ab832815f85e3d2) build user: root@7cbd4d1c15e0 build date: 20200805-17:26:58 go version: go1.14.6

  • Prometheus configuration file:
global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    cluster: prod1c
    prometheus: monitoring/prometheus-operator-prometheus
    prometheus_replica: prometheus-prometheus-operator-prometheus-0
alerting:
  alert_relabel_configs:
  - separator: ;
    regex: prometheus_replica
    replacement: $1
    action: labeldrop
  alertmanagers:
  - kubernetes_sd_configs:
    - role: endpoints
      namespaces:
        names:
        - monitoring
    scheme: http
    path_prefix: /
    timeout: 10s
    api_version: v1
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_name]
      separator: ;
      regex: prometheus-operator-alertmanager
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_endpoint_port_name]
      separator: ;
      regex: web
      replacement: $1
      action: keep
rule_files:
- /etc/prometheus/rules/prometheus-prometheus-operator-prometheus-rulefiles-0/*.yaml
- /etc/prometheus/rules/prometheus-prometheus-operator-prometheus-rulefiles-1/*.yaml
scrape_configs:
- job_name: asraas-prod/ambassador-asraas-prod/0
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - asraas-prod
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_service]
    separator: ;
    regex: ambassador-admin
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: ambassador-admin
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: ambassador-admin
    action: replace
- job_name: asraas-prod/bofa-eng-usa-400-krypton/0
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - asraas-prod
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_app]
    separator: ;
    regex: krypton
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_label_release]
    separator: ;
    regex: bofa-eng-usa-400
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: kr-svc-http
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: kr-svc-http
    action: replace
- job_name: asraas-prod/bofa-eng-usa-400-krypton/1
  honor_timestamps: true
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - asraas-prod
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_app]
    separator: ;
    regex: krypton
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_label_release]
    separator: ;
    regex: bofa-eng-usa-400
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: kr-fluentd-metrics-port
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: kr-fluentd-metrics-port
    action: replace
[...]

PS: I can provide the full configuration if that’s helpful, though it’s quite longer

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 1
  • Comments: 68 (31 by maintainers)

Most upvoted comments

@m-yosefpor we finally moved out all discovery logic outside of prometheus to separate daemon and switched to simple file/http discovery, it also allowed us to implement custom sharding logic.
Maybe it makes sense to open source this tool.

This problem hits any setup with k8s discovery and service monitors 😃 I think many engineers don’t even guess that they heat the air by using Prometheus to monitor k8s.

@brian-brazil Hey there! What if we have a label filter in the k8s discovery plugin as a fast win? In our case discovery produces ~120 labels per target, but we use only 5 of them. Yes, there is a relabel config, but the main performance issue is to alloc memory for the number of labels (30000 targets * 120 labels=~3 million labels), sort them, hash for deduplication, and then send to relabeling. This long and hard work happens each scrape pool sync. What do you think?

It seems we are also hitting this issue, but the weird thing is we are hitting the issue with only 1 of our prometheus servers!! (we have configured 2 replicas in prometheus operator, no sharding).

However the cpu usage is mostly in scrape.run for us rather than scrape.reload, so not sure if it’s the same problem.

prometheus-0: profile.pb.gz image

prometheus-1: profile(1).pb.gz image

You can see the difference in pprof between two instances. Also the difference in CPU usage of these instances:

image

More info:

$ oc get servicemonitor,podmonitor -A | wc -l
461
$ oc get po -A | wc -l
3646

Is that the one you need? profile001.svg.zip If not, can you provide the exact pprof command to get your required profile?

We are currently facing the same issue on our Prometheis instances: we have ~30 kube sd jobs (pod type) that are not constrained by namespaces and there are a lot of pods running on the platform (~5K pods) and the CPU usage for those instances is abnormally high. Having a look at the discovery page of the ui we can see that most of the jobs are keeping less than 100 targets out of ~28K each. In order to mitigate this we added selectors to some of the kube sd jobs, which helped reduce the CPU usage by half. However this is not perfect since it seems when a selector is matching nothing you see 0/28K in the discovery which is weird (I would have expected something like 0/0). Our impressions on this is that the less targets discovered you have as a result of the kube calls the less CPU is consumed in the end, meaning the consumption might not be linked to kube calls but to what is done afterwards (relabeling and such?).

NB: we also noticed that the rate on prometheus_target_sync_length_seconds_sum dropped after adding the selectors. sum(rate(prometheus_target_sync_length_seconds_sum{<filters>}[2m])) image

This is a sum but if you get the details for all the scrape jobs, the jobs with the new selector drop to ~0 and the other jobs see their rate dropping. rate(prometheus_target_sync_length_seconds_sum{<filters>}[2m]) image