prometheus: Prometheus 2.0 fails to start up after couple of restarts
What did you do?
We run 2 Prometheus instances on Google Container Engine using preemptibles. This means the instances are relocated at least every 24 hours (the max lifetime of a preemptible).
With version 1.7.1 this caused issues because the graceful shutdown sometimes took too long and didn’t fully finish. After that startup took longer than the initialDelaySeconds
of 1200s
causing Kubernetes to restart Prometheus over and over and making Prometheus unavailable. Deleting the instances and their data was the only simple way to get things up and running again.
With the much faster storage system in Prometheus 2 I had high hopes that this would no longer happen, but we seem to experience another cause of failure. After some restarts Prometheus does start and shows the following 4 log lines, but it fails to respond to the liveness check using the /status
endpoint.
time="2017-09-19T07:31:22Z" level=info msg="Starting prometheus (version=2.0.0-beta.3, branch=HEAD, revision=066783b3991dd64729325fc4f880dfffb484a2c2)" source="main.go:210"
time="2017-09-19T07:31:22Z" level=info msg="Build context (go=go1.8.3, user=root@0cbc320660dc, date=20170912-10:17:45)" source="main.go:211"
time="2017-09-19T07:31:22Z" level=info msg="Host details (Linux 4.4.52+ #1 SMP Thu Jul 13 11:47:20 PDT 2017 x86_64 prometheus-1 (none))" source="main.go:212"
time="2017-09-19T07:31:22Z" level=info msg="Starting tsdb" source="main.go:224"
What did you expect to see?
Prometheus to cope well with restarts and being able to gracefully shutdown within the max 30 seconds a preemptible shutdown allows.
What did you see instead? Under which circumstances?
The restarts / relocations work fine most of the time, but every now and then lead to failure. This usually seems to happen for both instances at pretty much the same time (or at least same day).
Environment
- System information:
Linux 4.4.64+ x86_64
- Prometheus version:
prometheus, version 2.0.0-beta.3 (branch: HEAD, revision: 066783b3991dd64729325fc4f880dfffb484a2c2)
build user: root@0cbc320660dc
build date: 20170912-10:17:45
go version: go1.8.3
- Prometheus configuration file:
global:
scrape_interval: 10s
scrape_timeout: 10s
evaluation_interval: 10s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-0.alertmanager-headless:9093
- alertmanager-1.alertmanager-headless:9093
- alertmanager-2.alertmanager-headless:9093
scheme: http
timeout: 10s
rule_files:
- /prometheus-rules/alert.yaml
- /prometheus-rules/aggregation.yaml
scrape_configs:
- job_name: kubernetes-apiservers
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
separator: ;
regex: default;kubernetes;https
replacement: $1
action: keep
- job_name: kubernetes-nodes
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: node
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
action: replace
- job_name: kubernetes-cadvisor
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: node
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}:4194/proxy/metrics
action: replace
- job_name: kubernetes-service-endpoints
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names: []
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: ([^:;]+);(\d+)
target_label: __address__
replacement: ${1}:${2}
action: replace
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: ([^:;]+):(\d+);(\d+)
target_label: __address__
replacement: ${1}:${3}
action: replace
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace
- job_name: kubernetes-services
params:
module:
- http_2xx
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /probe
scheme: http
kubernetes_sd_configs:
- api_server: null
role: service
namespaces:
names: []
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__address__]
separator: ;
regex: (.*?):(:80|:443)
replacement: $1
action: keep
- source_labels: [__address__]
separator: ;
regex: (.*?):80
target_label: __param_target
replacement: http://${1}
action: replace
- source_labels: [__address__]
separator: ;
regex: (.*?):443
target_label: __param_target
replacement: https://${1}
action: replace
- source_labels: [__param_target, __meta_kubernetes_service_annotation_prometheus_io_probe_path]
separator: ;
regex: (.*?);(.*?)
target_label: __param_target
replacement: ${1}${2}
action: replace
- separator: ;
regex: (.*)
target_label: __address__
replacement: blackbox-exporter
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace
- job_name: kubernetes-pods
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: pod
namespaces:
names: []
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: ([^:;]+);(\d+)
target_label: __address__
replacement: ${1}:${2}
action: replace
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: ([^:;]+):(\d+);(\d+)
target_label: __address__
replacement: ${1}:${3}
action: replace
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: kubernetes_pod_name
replacement: $1
action: replace
- job_name: kubernetes-nginx-sidecar
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: pod
namespaces:
names: []
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape_nginx_sidecar]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__address__]
separator: ;
regex: (.*):(\d+)
target_label: __address__
replacement: ${1}:9101
action: replace
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_scrape_nginx_sidecar_port]
separator: ;
regex: ([^:;]+):(\d+);(\d+)
target_label: __address__
replacement: ${1}:${3}
action: replace
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: kubernetes_pod_name
replacement: $1
action: replace
- job_name: gce-vms
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
gce_sd_configs:
- project: travix-production
zone: europe-west1-c
refresh_interval: 1m
port: 9101
tag_separator: ','
relabel_configs:
- source_labels: [__meta_gce_tags]
separator: ;
regex: .*,prometheus,.*
replacement: $1
action: keep
- source_labels: [__meta_gce_network]
separator: ;
regex: .*/production
replacement: $1
action: keep
- source_labels: [__meta_gce_instance_name]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- source_labels: [__meta_gce_tags]
separator: ;
regex: .*,app-([^,]+),.*
target_label: app
replacement: ${1}
action: replace
- source_labels: [__meta_gce_tags]
separator: ;
regex: .*,team-([^,]+),.*
target_label: team
replacement: ${1}
action: replace
- source_labels: [__meta_gce_tags]
separator: ;
regex: .*,version-([^,]+),.*
target_label: version
replacement: ${1}
action: replace
- job_name: gce-vms-win-metrics
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
gce_sd_configs:
- project: travix-production
zone: europe-west1-c
refresh_interval: 1m
port: 9182
tag_separator: ','
relabel_configs:
- source_labels: [__meta_gce_tags]
separator: ;
regex: .*,prometheus,.*
replacement: $1
action: keep
- source_labels: [__meta_gce_network]
separator: ;
regex: .*/production
replacement: $1
action: keep
- source_labels: [__meta_gce_instance_name]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- source_labels: [__meta_gce_tags]
separator: ;
regex: .*,app-([^,]+),.*
target_label: app
replacement: ${1}
action: replace
- source_labels: [__meta_gce_tags]
separator: ;
regex: .*,team-([^,]+),.*
target_label: team
replacement: ${1}
action: replace
- source_labels: [__meta_gce_tags]
separator: ;
regex: .*,version-([^,]+),.*
target_label: version
replacement: ${1}
action: replace
- Logs:
Last logs before it starts to fail
Hundreds of log lines similar to
time="2017-09-19T03:36:08Z" level=warning msg="append failed" err="no token found" source="scrape.go:670" target="{__address__="10.122.158.29:5000", __metrics_path__="/metrics", __scheme__="http", app="flightapi", instance="10.122.158.29:5000", job="kubernetes-pods", kubernetes_namespace="production", kubernetes_pod_name="flightapi-4094240061-9gpgb", pod_template_hash="4094240061", team="loki", tier="platform", track="stable", version="1.1.772"}
Logs when restarting after failure
time="2017-09-19T07:31:22Z" level=info msg="Starting prometheus (version=2.0.0-beta.3, branch=HEAD, revision=066783b3991dd64729325fc4f880dfffb484a2c2)" source="main.go:210"
time="2017-09-19T07:31:22Z" level=info msg="Build context (go=go1.8.3, user=root@0cbc320660dc, date=20170912-10:17:45)" source="main.go:211"
time="2017-09-19T07:31:22Z" level=info msg="Host details (Linux 4.4.52+ #1 SMP Thu Jul 13 11:47:20 PDT 2017 x86_64 prometheus-1 (none))" source="main.go:212"
time="2017-09-19T07:31:22Z" level=info msg="Starting tsdb" source="main.go:224"
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 3
- Comments: 39 (23 by maintainers)
the 2.3.2 release has a fix for that. https://github.com/prometheus/prometheus/pull/4370