prometheus: scrapePool.Sync performance regression with native histograms
What did you do?
We recently upgraded from v2.39.1 to v2.44.0 and enabled native histograms (--enable-feature=native-histograms
) on a pretty sizeable instance (~50M time series) and noticed some severe scrape performance regressions that meant we had to roll back.
What did you expect to see?
No response
What did you see instead? Under which circumstances?
Note on graphs below: we restarted prometheus-services-0 at 08:31 UTC. It finished replaying the WAL and came online at 08:51 UTC.
-
Sample ingestion on the upgraded node decreased by about 45% and we were missing a lot of metrics:
-
Targets page showed multiple minutes between scrapes of most pods (explaining the decrease in sample ingestion), despite our scrape interval being correctly set to 15 seconds
-
The scrapes themselves were completing successfully - there were no reported HTTP errors or timeouts
-
prometheus_target_sync_length_seconds increased in all quantiles - e.g. 0.9 went from around 18 seconds to having spikes of 280+
Here are two pprof 30s CPU profiles comparing the two versions running at the same time (let me know if there’s an easier way to share these):
The one experiment we’ve tried is running v2.44.0 without native histograms support. This has reduced the target sync latency. Here is prometheus-services-1:
- 12:22 - upgraded to 2.44.0 with native histograms enabled
- 12:41 - WAL replay finishes, target sync latency increases
- 13:38 - restarted with native histograms disabled
- 14:04 - WAL replay finishes
Also worth noting that reported CPU usage with native histograms enabled actually drops:
Other environment details to note:
- We’re not actually exporting any native histograms from services yet, but lots of classic histograms
- The majority of our exporters only return text/plain data even if OpenMetrics/proto accept headers are given
Let me know if there’s any more info I can provide
System information
quay.io/prometheus/busybox image
Prometheus version
prometheus --version
prometheus, version 2.44.0 (branch: HEAD, revision: 1ac5131f698ebc60f13fe2727f89b115a41f6558)
build user: root@739e8181c5db
build date: 20230514-06:18:11
go version: go1.20.4
platform: linux/amd64
tags: netgo,builtinassets,stringlabels
Prometheus configuration file
global:
scrape_interval: 15s
scrape_timeout: 15s
evaluation_interval: 30s
external_labels:
cell: hub
prometheus_replica: prometheus-services-0
prometheus_shard: services
alerting:
alert_relabel_configs:
- separator: ;
regex: prometheus_replica
replacement: $1
action: labeldrop
alertmanagers:
- follow_redirects: true
enable_http2: true
scheme: http
timeout: 10s
api_version: v2
static_configs:
- targets:
- alertmanager
rule_files:
- /etc/prometheus/alerts/runbooks/*.rules
- /etc/prometheus/alerts/*.rules
- /etc/prometheus/rules/*.rules
- /etc/prometheus/alerts/runbooks/*.yaml
- /etc/prometheus/alerts/*.yaml
- /etc/prometheus/rules/*.yaml
scrape_configs:
- job_name: prometheus-disk-monitor
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 15s
metrics_path: /metrics
scheme: http
follow_redirects: true
enable_http2: true
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: ^go_.+
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: ^http_.+
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: ^process_.+
replacement: $1
action: drop
static_configs:
- targets:
- localhost:9100
- job_name: prometheus
honor_labels: true
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
follow_redirects: true
enable_http2: true
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
separator: ;
regex: prometheus
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_label_prometheus_shard]
separator: ;
regex: services
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_name]
separator: ;
regex: ^prometheus|http|rules$
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_container_port_number]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: instance
replacement: $1:$2
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: (.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_port_name]
separator: ;
regex: (.*)
target_label: port
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_controller_name]
separator: /
regex: (.*)
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_monzo_system]
separator: ;
regex: (.*)
target_label: monzo_system
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_monzo_component]
separator: ;
regex: (.*)
target_label: monzo_component
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_phase]
separator: ;
regex: (.*)
target_label: phase
replacement: $1
action: replace
kubernetes_sd_configs:
- role: pod
kubeconfig_file: ""
follow_redirects: true
enable_http2: true
- job_name: kubernetes-pods
honor_labels: true
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 15s
metrics_path: /metrics
scheme: http
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
follow_redirects: true
enable_http2: true
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape, __meta_kubernetes_pod_label_io_gmon_routing_name]
separator: ;
regex: ^true;.+$
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: (.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_code_owner]
separator: ;
regex: (.*)
target_label: code_owner
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_monzo_system]
separator: ;
regex: (.*)
target_label: monzo_system
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_monzo_component]
separator: ;
regex: (.*)
target_label: monzo_component
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_phase]
separator: ;
regex: (.*)
target_label: phase
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_app]
separator: ;
regex: (.*)
target_label: app
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_component]
separator: ;
regex: (.*)
target_label: component
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app]
separator: /
regex: (.*)
target_label: job
replacement: $1
action: replace
metric_relabel_configs:
- source_labels: [le]
separator: ;
regex: ^([0-9]+)$
target_label: le
replacement: ${1}.0
action: replace
kubernetes_sd_configs:
- role: pod
kubeconfig_file: ""
follow_redirects: true
enable_http2: true
storage:
exemplars:
max_exemplars: 100000
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
No relevant logs found
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 15 (7 by maintainers)
BTW the reason I was talking about selectors is the “stringlabels” change in 2.43/44 made SD a bit worse on dropped targets (possibly a lot worse depending on your labels and annotations). I created a separate issue to document that point: #12482