prometheus: Opening storage failed" err="invalid block sequence"
What did you do? I ran prometheus2.0.0 on kubernetesv1.8.5
What did you expect to see? Everything went well.
What did you see instead? Under which circumstances? Everything went well at beginning. But several hours later, pods’ statuses turned to “CrashLoopBackOff”, all prometheus turned unavaliable. After create pods, I didnt do anything.
[root@k8s-1 prometheus]# kubectl get all -n monitoring
NAME DESIRED CURRENT AGE
statefulsets/prometheus-k8s 0 2 16h
NAME READY STATUS RESTARTS AGE
po/prometheus-k8s-0 0/1 CrashLoopBackOff 81 16h
po/prometheus-k8s-1 0/1 CrashLoopBackOff 22 16h
Environment
[root@k8s-1 prometheus]# kubectl version --short
Client Version: v1.8.5
Server Version: v1.8.5
[root@k8s-1 prometheus]# docker images | grep -i prometheus
quay.io/prometheus/alertmanager v0.12.0 f87cbd5f1360 5 weeks ago 31.2 MB
quay.io/prometheus/node_exporter v0.15.2 ff5ecdcfc4a2 6 weeks ago 22.8 MB
quay.io/prometheus/prometheus v2.0.0 67141fa03496 2 months ago 80.2 MB
-
System information:
[root@k8s-1 prometheus]# uname -srm Linux 3.10.0-229.el7.x86_64 x86_64 ```
-
Prometheus version:
v2.0.0
-
Prometheus configuration file:
[root@k8s-1 prometheus]# cat prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-k8s-config
namespace: monitoring
data:
prometheus.yaml: |
global:
scrape_interval: 10s
scrape_timeout: 10s
evaluation_interval: 10s
rule_files:
- "/etc/prometheus-rules/*.rules"
# A scrape configuration for running Prometheus on a Kubernetes cluster.
# This uses separate scrape configs for cluster components (i.e. API server, node)
# and services to allow each to use different authentication configs.
#
# Kubernetes labels will be added as Prometheus labels on metrics via the
# `labelmap` relabeling action.
#
# If you are using Kubernetes 1.7.2 or earlier, please take note of the comments
# for the kubernetes-cadvisor job; you will need to edit or remove this job.
# Scrape config for API servers.
#
# Kubernetes exposes API servers as endpoints to the default/kubernetes
# service so this uses `endpoints` role and uses relabelling to only keep
# the endpoints associated with the default/kubernetes service using the
# default named port `https`. This works for single API server deployments as
# well as HA API server deployments.
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
# insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# Keep only the default/kubernetes service endpoints for the https port. This
# will add targets for each API server which Kubernetes adds an endpoint to
# the default/kubernetes service.
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Scrape config for nodes (kubelet).
#
# Rather than connecting directly to the node, the scrape is proxied though the
# Kubernetes apiserver. This means it will work if Prometheus is running out of
# cluster, or can't connect to nodes for some other reason (e.g. because of
# firewalling).
- job_name: 'kubernetes-nodes'
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Scrape config for Kubelet cAdvisor.
#
# This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
# (those whose names begin with 'container_') have been removed from the
# Kubelet metrics endpoint. This job scrapes the cAdvisor endpoint to
# retrieve those metrics.
#
# In Kubernetes 1.7.0-1.7.2, these metrics are only exposed on the cAdvisor
# HTTP endpoint; use "replacement: /api/v1/nodes/${1}:4194/proxy/metrics"
# in that case (and ensure cAdvisor's HTTP server hasn't been disabled with
# the --cadvisor-port=0 Kubelet flag).
#
# This job is not necessary and should be removed in Kubernetes 1.6 and
# earlier versions, or it will cause the metrics to be scraped twice.
- job_name: 'kubernetes-cadvisor'
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
# Example scrape config for probing services via the Blackbox Exporter.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.example.com:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
# Example scrape config for probing ingresses via the Blackbox Exporter.
#
# The relabeling allows the actual ingress scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
- job_name: 'kubernetes-ingresses'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: ingress
relabel_configs:
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
regex: (.+);(.+);(.+)
replacement: ${1}://${2}${3}
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.example.com:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_ingress_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_ingress_name]
target_label: kubernetes_name
# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
# pod's declared ports (default is a port-free target if none are declared).
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
[root@k8s-1 prometheus]# cat prometheus-all-together.yaml
apiVersion: v1
kind: Service
metadata:
labels:
prometheus: k8s
name: prometheus-k8s
namespace: monitoring
annotations:
prometheus.io/scrape: "true"
spec:
ports:
- name: web
nodePort: 30900
port: 9090
protocol: TCP
targetPort: web
selector:
prometheus: k8s
sessionAffinity: None
type: NodePort
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
prometheus: k8s
name: prometheus-k8s
namespace: monitoring
spec:
selector:
matchLabels:
app: prometheus
prometheus: k8s
serviceName: prometheus-k8s
replicas: 2
template:
metadata:
labels:
app: prometheus
prometheus: k8s
spec:
securityContext:
runAsUser: 65534
fsGroup: 65534
runAsNonRoot: true
containers:
- args:
- --config.file=/etc/prometheus/config/prometheus.yaml
- --storage.tsdb.path=/cephfs/prometheus/data
- --storage.tsdb.retention=180d
- --web.route-prefix=/
- --web.enable-lifecycle
- --web.enable-admin-api
image: quay.io/prometheus/prometheus:v2.0.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 10
httpGet:
path: /status
port: web
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
name: prometheus
ports:
- containerPort: 9090
name: web
protocol: TCP
readinessProbe:
failureThreshold: 6
httpGet:
path: /status
port: web
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 500m
memory: 500Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/prometheus/config
name: config
readOnly: false
- mountPath: /etc/prometheus/rules
name: rules
readOnly: false
- mountPath: /cephfs/prometheus/data
name: data
subPath: prometheus-data
readOnly: false
serviceAccount: prometheus-k8s
serviceAccountName: prometheus-k8s
terminationGracePeriodSeconds: 60
volumes:
- configMap:
defaultMode: 511
name: prometheus-k8s-config
name: config
- configMap:
defaultMode: 511
name: prometheus-k8s-rules
name: rules
- name: data
persistentVolumeClaim:
claimName: cephfs-pvc
updateStrategy:
type: RollingUpdate
- Logs:
[root@k8s-1 prometheus]# kubectl logs prometheus-k8s-0 -n monitoring
level=info ts=2018-01-20T03:16:32.966070249Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2018-01-20T03:16:32.966225361Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2018-01-20T03:16:32.966252185Z caller=main.go:217 host_details="(Linux 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 prometheus-k8s-0 (none))"
level=info ts=2018-01-20T03:16:32.969789371Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-20T03:16:32.971388907Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2018-01-20T03:16:32.971596811Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2018-01-20T03:16:59.781338012Z caller=main.go:323 msg="Opening storage failed" err="invalid block sequence: block time ranges overlap (1516348800000, 1516356000000)"
[root@k8s-1 prometheus]#
[root@k8s-1 prometheus]# kubectl logs prometheus-k8s-1 -n monitoring
level=info ts=2018-01-20T03:15:22.701351679Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2018-01-20T03:15:22.70148418Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2018-01-20T03:15:22.701512333Z caller=main.go:217 host_details="(Linux 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 prometheus-k8s-1 (none))"
level=info ts=2018-01-20T03:15:22.705824203Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-20T03:15:22.707629775Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2018-01-20T03:15:22.707837323Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2018-01-20T03:15:54.775639791Z caller=main.go:323 msg="Opening storage failed" err="invalid block sequence: block time ranges overlap (1516348800000, 1516356000000)"
[root@k8s-1 prometheus]# kubectl describe po/prometheus-k8s-0 -n monitoring
Name: prometheus-k8s-0
Namespace: monitoring
Node: k8s-3/172.16.1.8
Start Time: Fri, 19 Jan 2018 17:59:38 +0800
Labels: app=prometheus
controller-revision-hash=prometheus-k8s-7d86dfbd86
prometheus=k8s
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"StatefulSet","namespace":"monitoring","name":"prometheus-k8s","uid":"7593d8ac-fcff-11e7-9333-fa163e48f857"...
Status: Running
IP: 10.244.2.54
Created By: StatefulSet/prometheus-k8s
Controlled By: StatefulSet/prometheus-k8s
Containers:
prometheus:
Container ID: docker://98faabe55fb71050aacd776d349a6567c25c339117159356eedc10cbc19ef02a
Image: quay.io/prometheus/prometheus:v2.0.0
Image ID: docker-pullable://quay.io/prometheus/prometheus@sha256:53afe934a8d497bb703dbbf7db273681a56677775c462833da8d85015471f7a3
Port: 9090/TCP
Args:
--config.file=/etc/prometheus/config/prometheus.yaml
--storage.tsdb.path=/cephfs/prometheus/data
--storage.tsdb.retention=180d
--web.route-prefix=/
--web.enable-lifecycle
--web.enable-admin-api
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sat, 20 Jan 2018 11:11:00 +0800
Finished: Sat, 20 Jan 2018 11:11:29 +0800
Ready: False
Restart Count: 84
Limits:
cpu: 500m
memory: 500Mi
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get http://:web/status delay=30s timeout=3s period=5s #success=1 #failure=10
Readiness: http-get http://:web/status delay=0s timeout=3s period=5s #success=1 #failure=6
Environment: <none>
Mounts:
/cephfs/prometheus/data from data (rw)
/etc/prometheus/config from config (rw)
/etc/prometheus/rules from rules (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-x8xzh (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-k8s-config
Optional: false
rules:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-k8s-rules
Optional: false
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cephfs-pvc
ReadOnly: false
prometheus-k8s-token-x8xzh:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-k8s-token-x8xzh
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 15m (x83 over 17h) kubelet, k8s-3 Container image "quay.io/prometheus/prometheus:v2.0.0" already present on machine
Warning FailedSync 23s (x1801 over 7h) kubelet, k8s-3 Error syncing pod
Any suggestions?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 29 (4 by maintainers)
Here’s how it went for me (running docker container with
prom/prometheus:v2.3.0
). OS was rebooted (manually), after reboot prometheus kept restarting withlevel=error ts=2018-07-09T09:44:19.761219359Z caller=main.go:597 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1530856800000, maxt: 1530864000000, range: 2h0m0s, blocks: 2]: <ulid: 01CHQD40DG2QE2ZE3MFMMQ1VFS, mint: 1530856800000, maxt: 1530864000000, range: 2h0m0s>, <ulid: 01CHZ45KDMB5S64X6R3AQMWSXD, mint: 1530856800000, maxt: 1530878400000, range: 6h0m0s>\n[mint: 1530871200000, maxt: 1530878400000, range: 2h0m0s, blocks: 2]: <ulid: 01CHZ45KDMB5S64X6R3AQMWSXD, mint: 1530856800000, maxt: 1530878400000, range: 6h0m0s>, <ulid: 01CHQTVEXG910WRSSS7S6D264W, mint: 1530871200000, maxt: 1530878400000, range: 2h0m0s>"
I stopped the container and checked the volume data:Note the last two directories, they’re the heaviest. If you check
ulid
s mentioned in logs, you will notice they match names of directories. After messing around a little with moving away smaller directories with IDs from logs, I ended up with the same message in logs: somehow, prometheus encounters same time ranges in different chunks from different directories (my speculations only, no idea what kind of satanic magic it runs by). So I did what seemed to be logical: created a backup directory, moved there everything except forwal
directory and the latest (heaviest) non-.tmp
directory. So it looked like this:Started prometheus again, voila, works again, and the data is there and accessible (I can see it by running queries from the very beginning of monitoring history). Hope it’d help somebody.
Sorry about that, this is a bug, fix is here: https://github.com/prometheus/tsdb/pull/299 A new bug fix release will be out soon.
Is there a way to recover from this error without flushing data out? I don’t want to lose a chunk of my metrics data because of this 😐
We also have 2.2.0 and this issue has few additional symptoms:
I hope it could help diagnose a problem.