prometheus: Prometheus 2.0 fails to start up after couple of restarts

What did you do?

We run 2 Prometheus instances on Google Container Engine using preemptibles. This means the instances are relocated at least every 24 hours (the max lifetime of a preemptible).

With version 1.7.1 this caused issues because the graceful shutdown sometimes took too long and didn’t fully finish. After that startup took longer than the initialDelaySeconds of 1200s causing Kubernetes to restart Prometheus over and over and making Prometheus unavailable. Deleting the instances and their data was the only simple way to get things up and running again.

With the much faster storage system in Prometheus 2 I had high hopes that this would no longer happen, but we seem to experience another cause of failure. After some restarts Prometheus does start and shows the following 4 log lines, but it fails to respond to the liveness check using the /status endpoint.

time="2017-09-19T07:31:22Z" level=info msg="Starting prometheus (version=2.0.0-beta.3, branch=HEAD, revision=066783b3991dd64729325fc4f880dfffb484a2c2)" source="main.go:210"
time="2017-09-19T07:31:22Z" level=info msg="Build context (go=go1.8.3, user=root@0cbc320660dc, date=20170912-10:17:45)" source="main.go:211"
time="2017-09-19T07:31:22Z" level=info msg="Host details (Linux 4.4.52+ #1 SMP Thu Jul 13 11:47:20 PDT 2017 x86_64 prometheus-1 (none))" source="main.go:212"
time="2017-09-19T07:31:22Z" level=info msg="Starting tsdb" source="main.go:224"

What did you expect to see?

Prometheus to cope well with restarts and being able to gracefully shutdown within the max 30 seconds a preemptible shutdown allows.

What did you see instead? Under which circumstances?

The restarts / relocations work fine most of the time, but every now and then lead to failure. This usually seems to happen for both instances at pretty much the same time (or at least same day).

Environment

  • System information:

Linux 4.4.64+ x86_64

  • Prometheus version:
prometheus, version 2.0.0-beta.3 (branch: HEAD, revision: 066783b3991dd64729325fc4f880dfffb484a2c2)
  build user:       root@0cbc320660dc
  build date:       20170912-10:17:45
  go version:       go1.8.3
  • Prometheus configuration file:
global:
  scrape_interval: 10s
  scrape_timeout: 10s
  evaluation_interval: 10s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager-0.alertmanager-headless:9093
      - alertmanager-1.alertmanager-headless:9093
      - alertmanager-2.alertmanager-headless:9093
    scheme: http
    timeout: 10s
rule_files:
- /prometheus-rules/alert.yaml
- /prometheus-rules/aggregation.yaml
scrape_configs:
- job_name: kubernetes-apiservers
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: default;kubernetes;https
    replacement: $1
    action: keep
- job_name: kubernetes-nodes
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics
    action: replace
- job_name: kubernetes-cadvisor
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}:4194/proxy/metrics
    action: replace
- job_name: kubernetes-service-endpoints
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:;]+);(\d+)
    target_label: __address__
    replacement: ${1}:${2}
    action: replace
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:;]+):(\d+);(\d+)
    target_label: __address__
    replacement: ${1}:${3}
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace
- job_name: kubernetes-services
  params:
    module:
    - http_2xx
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /probe
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: service
    namespaces:
      names: []
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__address__]
    separator: ;
    regex: (.*?):(:80|:443)
    replacement: $1
    action: keep
  - source_labels: [__address__]
    separator: ;
    regex: (.*?):80
    target_label: __param_target
    replacement: http://${1}
    action: replace
  - source_labels: [__address__]
    separator: ;
    regex: (.*?):443
    target_label: __param_target
    replacement: https://${1}
    action: replace
  - source_labels: [__param_target, __meta_kubernetes_service_annotation_prometheus_io_probe_path]
    separator: ;
    regex: (.*?);(.*?)
    target_label: __param_target
    replacement: ${1}${2}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: blackbox-exporter
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace
- job_name: kubernetes-pods
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: pod
    namespaces:
      names: []
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:;]+);(\d+)
    target_label: __address__
    replacement: ${1}:${2}
    action: replace
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:;]+):(\d+);(\d+)
    target_label: __address__
    replacement: ${1}:${3}
    action: replace
  - separator: ;
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace
- job_name: kubernetes-nginx-sidecar
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: pod
    namespaces:
      names: []
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape_nginx_sidecar]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__address__]
    separator: ;
    regex: (.*):(\d+)
    target_label: __address__
    replacement: ${1}:9101
    action: replace
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_scrape_nginx_sidecar_port]
    separator: ;
    regex: ([^:;]+):(\d+);(\d+)
    target_label: __address__
    replacement: ${1}:${3}
    action: replace
  - separator: ;
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace
- job_name: gce-vms
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  gce_sd_configs:
  - project: travix-production
    zone: europe-west1-c
    refresh_interval: 1m
    port: 9101
    tag_separator: ','
  relabel_configs:
  - source_labels: [__meta_gce_tags]
    separator: ;
    regex: .*,prometheus,.*
    replacement: $1
    action: keep
  - source_labels: [__meta_gce_network]
    separator: ;
    regex: .*/production
    replacement: $1
    action: keep
  - source_labels: [__meta_gce_instance_name]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - source_labels: [__meta_gce_tags]
    separator: ;
    regex: .*,app-([^,]+),.*
    target_label: app
    replacement: ${1}
    action: replace
  - source_labels: [__meta_gce_tags]
    separator: ;
    regex: .*,team-([^,]+),.*
    target_label: team
    replacement: ${1}
    action: replace
  - source_labels: [__meta_gce_tags]
    separator: ;
    regex: .*,version-([^,]+),.*
    target_label: version
    replacement: ${1}
    action: replace
- job_name: gce-vms-win-metrics
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  gce_sd_configs:
  - project: travix-production
    zone: europe-west1-c
    refresh_interval: 1m
    port: 9182
    tag_separator: ','
  relabel_configs:
  - source_labels: [__meta_gce_tags]
    separator: ;
    regex: .*,prometheus,.*
    replacement: $1
    action: keep
  - source_labels: [__meta_gce_network]
    separator: ;
    regex: .*/production
    replacement: $1
    action: keep
  - source_labels: [__meta_gce_instance_name]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - source_labels: [__meta_gce_tags]
    separator: ;
    regex: .*,app-([^,]+),.*
    target_label: app
    replacement: ${1}
    action: replace
  - source_labels: [__meta_gce_tags]
    separator: ;
    regex: .*,team-([^,]+),.*
    target_label: team
    replacement: ${1}
    action: replace
  - source_labels: [__meta_gce_tags]
    separator: ;
    regex: .*,version-([^,]+),.*
    target_label: version
    replacement: ${1}
    action: replace
  • Logs:

Last logs before it starts to fail

Hundreds of log lines similar to

time="2017-09-19T03:36:08Z" level=warning msg="append failed" err="no token found" source="scrape.go:670" target="{__address__="10.122.158.29:5000", __metrics_path__="/metrics", __scheme__="http", app="flightapi", instance="10.122.158.29:5000", job="kubernetes-pods", kubernetes_namespace="production", kubernetes_pod_name="flightapi-4094240061-9gpgb", pod_template_hash="4094240061", team="loki", tier="platform", track="stable", version="1.1.772"}

Logs when restarting after failure

time="2017-09-19T07:31:22Z" level=info msg="Starting prometheus (version=2.0.0-beta.3, branch=HEAD, revision=066783b3991dd64729325fc4f880dfffb484a2c2)" source="main.go:210"
time="2017-09-19T07:31:22Z" level=info msg="Build context (go=go1.8.3, user=root@0cbc320660dc, date=20170912-10:17:45)" source="main.go:211"
time="2017-09-19T07:31:22Z" level=info msg="Host details (Linux 4.4.52+ #1 SMP Thu Jul 13 11:47:20 PDT 2017 x86_64 prometheus-1 (none))" source="main.go:212"
time="2017-09-19T07:31:22Z" level=info msg="Starting tsdb" source="main.go:224"

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 3
  • Comments: 39 (23 by maintainers)

Most upvoted comments