prometheus: Prometheus goes to a OOM loop and states "unknown series references"
Bug Report
What did you do?
Running prometheus as a docker container on a stand-alone box.
What did you expect to see? Prometheus running without any issues. What did you see instead? Under which circumstances?
Prometheus is going in an endless crashing state and getting the below error, once it restarts.
level=warn ts=2019-05-13T09:02:15.270Z caller=head.go:454 component=tsdb msg="unknown series references" count=1085
Environment
Prometheus version:
2.9.2
Alertmanager version: N/A
Prometheus configuration file:
global:
scrape_interval: 1m
scrape_timeout: 10s
evaluation_interval: 1m
scrape_configs:
- job_name: nodeexporter
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 9100
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_monitoring]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: (.*)
target_label: asg_group
replacement: $1
action: replace
- job_name: elasticexporter
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /_prometheus/metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 9200
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_elasticsearch]
separator: ;
regex: "true"
replacement: $1
action: keep
- job_name: prod-blue-traefik-external
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: prod-blue-traefik-internal
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: prod-blue-traefik-concierge
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: stage-blue-traefik-external
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: stage-blue-traefik-internal
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: stage-blue-traefik-concierge
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: <env>-green-traefik-external
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: <env>-green-traefik-internal
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: <env>-green-traefik-concierge
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: <env>-blue-traefik-external
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: <env>-blue-traefik-internal
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: <env>-blue-traefik-concierge
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- <endpoint>
- job_name: prod-blue-kube-state-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
static_configs:
- targets:
- <endpoint>
tls_config:
insecure_skip_verify: true
- job_name: stage-blue-kube-state-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
static_configs:
- targets:
- <endpoint>
tls_config:
insecure_skip_verify: true
- job_name: <env>-green-state-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
static_configs:
- targets:
- <endpoint>
tls_config:
insecure_skip_verify: true
- job_name: <env>-blue-state-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
static_configs:
- targets:
- <endpoint>
tls_config:
insecure_skip_verify: true
- job_name: prod-blue-kube-cadvisor
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 4194
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: prod-blue.*
replacement: $1
action: keep
- job_name: <env>-green-kube-cadvisor
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 4194
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: <env>-green.*
replacement: $1
action: keep
- job_name: <env>-blue-kube-cadvisor
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 4194
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: <env>-blue.*
replacement: $1
action: keep
- job_name: stage-blue-kube-cadvisor
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 4194
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: stage-blue.*
replacement: $1
action: keep
- job_name: prod-blue-etcd-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 2379
filters: []
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: prod-blue-etcd.*
replacement: $1
action: keep
- job_name: <env>-green-etcd-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 2379
filters: []
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: <env>-green-etcd.*
replacement: $1
action: keep
- job_name: <env>-blue-etcd-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 2379
filters: []
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: <env>-blue-etcd.*
replacement: $1
action: keep
- job_name: stage-blue-etcd-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 2379
filters: []
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: stage-blue-etcd.*
replacement: $1
action: keep
- job_name: prod-blue-kube-router-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /kube-router/metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 63330
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_environment]
separator: ;
regex: prod
replacement: $1
action: keep
- source_labels: [__meta_ec2_tag_role]
separator: ;
regex: blue
replacement: $1
action: keep
- job_name: <env>-green-kube-router-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /kube-router/metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 63330
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_environment]
separator: ;
regex: cde
replacement: $1
action: keep
- source_labels: [__meta_ec2_tag_role]
separator: ;
regex: green
replacement: $1
action: keep
- job_name: <env>-blue-kube-router-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /kube-router/metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 63330
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_environment]
separator: ;
regex:<env>
replacement: $1
action: keep
- source_labels: [__meta_ec2_tag_role]
separator: ;
regex: blue
replacement: $1
action: keep
- job_name: stage-blue-kube-router-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /kube-router/metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: ap-south-1
refresh_interval: 1m
port: 63330
filters: []
relabel_configs:
- source_labels: [__meta_ec2_tag_environment]
separator: ;
regex: stage
replacement: $1
action: keep
- source_labels: [__meta_ec2_tag_role]
separator: ;
regex: blue
replacement: $1
action: keep
- job_name: push-gateway
honor_timestamps: true
scrape_interval: 10m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- localhost:9091
- job_name: prometheus-self
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- localhost:9090
-
System information:
Linux 4.14.11-coreos x86_64
-
Prometheus version:
2.9.2 (Faced same with 2.7.0)
-
Alertmanager version:
NA
-
Logs:
level=warn ts=2019-05-13T08:44:32.392Z caller=head.go:454 component=tsdb msg="unknown series references" count=997
level=info ts=2019-05-13T08:44:50.284Z caller=main.go:655 msg="TSDB started"
level=info ts=2019-05-13T08:44:50.284Z caller=main.go:724 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yaml
level=info ts=2019-05-13T08:44:50.297Z caller=main.go:751 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yaml
level=info ts=2019-05-13T08:44:50.297Z caller=main.go:609 msg="Server is ready to receive web requests."
fatal error: runtime: out of memory
Also the initial start is also taking more than 15 minutes.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (2 by maintainers)
This has not been fixed, generally if Prometheus OOMs once, it is very possible that it will OOM again shortly after starting up as it replays all the previous state back into memory. Without losing data it will be challenging to fix that issue unless you increase the memory Prometheus has access to after an OOM.
Prometheus v2.12.0 has some improvements around this such as general memory improvements inside of tsdb, and logging to help see the WAL replay progress (generally what is causing the long startup times). There has been some discussion on ways to improve the WAL startup time as well, but no work has been merged yet. One example is: https://github.com/prometheus/prometheus/pull/6059.