prometheus-operator: High CPU/Memory when monitoring 1000+ ServiceMonitor/Endpoints/Service objects

What happened? I have configured many (1000+) external services to an operator-managed Prometheus using Kubernetes resources (Service/Endpoint/ServiceMonitor) and I have noticed very high CPU and Memory usage in Prometheus (10 cores and 25GB RAM) as well as instability (Liveness errors and restarts).

I then changed targets provider from ServiceMonitor to static files using additionalScrapeConfigs and manually copied the targets to the config_out source mount dir.

    additionalScrapeConfigs:
    - job_name: 'foo'
      file_sd_configs:
        - files:
          - /etc/prometheus/config_out/targets/foo/*.json

Then CPU load and memory dropped dramatically (less than 0.5 core and 3GB RAM).

Did you expect to see some different? I was not expecting to see high CPU and Memory usage when using K8s ServiceMonitor objects to monitor the very same targets.

How to reproduce it (as minimally and precisely as possible): Bootstrap many ServiceMonitor objects in order to monitor external services and monitor CPU and RAM consumption. Compare this with static files.

Environment

Prometheus Operator version:

6.7.3 I have verified this with Prometheus 2.10.0 and 2.14.0.
Kubernetes version information:

Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.6", GitCommit:"abdda3f9fefa29172298a2e42f5102e777a8ec25", GitTreeState:"clean", BuildDate:"2019-05-08T13:46:28Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster kind:

Pharos.

Anything else we need to know?: When using K8s ServiceMonitor objects, /etc/prometheus/config_out/prometheus.env.yaml was very big (1000+ line) compared to 186 lines when using the additionalScrapeConfigs.

Also, not sure if this is helpful or not but in Prometheus, under the targets page:

All targets were under a common group when using static files.
Each target was under its own group when using serviceMonitor objects. I tried using the jobLabel spec attribute but I didn’t manage to group them all to one group.

I have also managed to retrieve pprof output when using serviceMonitor objects

(pprof) top30
Showing nodes accounting for 7.07mins, 82.28% of 8.59mins total
Dropped 930 nodes (cum <= 0.04mins)
Showing top 30 nodes out of 105
      flat  flat%   sum%        cum   cum%
  1.93mins 22.46% 22.46%   1.93mins 22.46%  runtime.memclrNoHeapPointers
  0.78mins  9.02% 31.49%   1.60mins 18.59%  runtime.scanobject
  0.61mins  7.07% 38.56%   0.83mins  9.61%  runtime.findObject
  0.45mins  5.20% 43.76%   0.45mins  5.20%  runtime.futex
  0.39mins  4.53% 48.29%   0.39mins  4.53%  cmpbody
  0.34mins  3.96% 52.26%   0.34mins  3.96%  runtime.(*mspan).base
  0.28mins  3.21% 55.47%   0.28mins  3.21%  runtime.procyield
  0.24mins  2.81% 58.28%   0.24mins  2.81%  runtime.markBits.isMarked
  0.20mins  2.28% 60.56%   0.21mins  2.47%  runtime.heapBitsSetType
  0.19mins  2.22% 62.78%   0.40mins  4.66%  runtime.mapiternext
  0.14mins  1.67% 64.45%   1.32mins 15.40%  runtime.sweepone
  0.14mins  1.66% 66.11%   0.58mins  6.81%  sort.doPivot
  0.14mins  1.65% 67.76%   6.17mins 71.86%  github.com/prometheus/prometheus/scrape.targetsFromGroup
  0.12mins  1.45% 69.22%   0.38mins  4.38%  sort.insertionSort
  0.11mins  1.22% 70.44%   0.12mins  1.45%  runtime.spanOf
  0.10mins  1.18% 71.62%   0.29mins  3.33%  runtime.(*mTreap).insert
  0.09mins  1.07% 72.69%   0.61mins  7.08%  runtime.lock
  0.09mins  1.06% 73.75%   0.09mins  1.06%  runtime.heapBits.bits
  0.08mins  0.91% 74.66%   0.08mins  0.91%  runtime.aeshashbody
  0.07mins  0.86% 75.52%   0.52mins  6.06%  runtime.wbBufFlush1
  0.07mins  0.85% 76.37%   0.07mins  0.85%  runtime.(*mSpanList).remove
  0.07mins  0.81% 77.17%   0.21mins  2.46%  runtime.(*mheap).coalesce
  0.07mins  0.79% 77.96%   0.50mins  5.88%  runtime.gcWriteBarrier
  0.07mins  0.78% 78.74%   0.07mins  0.78%  runtime.nextFreeFast
  0.07mins  0.76% 79.50%   4.96mins 57.74%  runtime.mallocgc
  0.05mins  0.59% 80.09%   0.08mins  0.95%  runtime.mapaccess2_faststr
  0.05mins  0.58% 80.68%   0.05mins  0.58%  runtime.(*gcSweepBuf).pop
  0.05mins  0.54% 81.21%   0.05mins  0.54%  runtime.mapaccess1_fast64
  0.05mins  0.54% 81.75%   0.05mins  0.54%  memeqbody
  0.05mins  0.53% 82.28%   0.14mins  1.59%  runtime.(*mTreap).removeSpan

There is so much CPU spent on scraping targets (github.com/prometheus/prometheus/scrape.targetsFromGroup, runtime.futex), on clearing non heap pointers (runtime.memclrNoHeapPointers), finding objects in the non-heap area (runtime.scanobject), finding stack object containing a specific address (runtime.findObject) and comparing labels in time series (cmpbody, sort.doPivot, sort.InsertionsSort).

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 3
Comments: 15 (5 by maintainers)

Most upvoted comments

The same issue here. When I added ServiceMonitor per each Statefulset (156 targets), CPU utilization dramatically increased (from 0.8 cpu to 4 cpu). I’ve switched to use only one ServiceMonitor for all Statefulset service and the load has dropped back.

BTW, prometheus_target_sync_length_seconds_sum metric has dropped too. WTF?

@simonpasquier Looks like it’s a bug, so the issue should be reopened.

Could anyone explain why does it happen?

kam1kaze on Aug 26, 2020