prometheus: Opening storage failed" err="invalid block sequence"

What did you do? I ran prometheus2.0.0 on kubernetesv1.8.5

What did you expect to see? Everything went well.

What did you see instead? Under which circumstances? Everything went well at beginning. But several hours later, pods’ statuses turned to “CrashLoopBackOff”, all prometheus turned unavaliable. After create pods, I didnt do anything.

[root@k8s-1 prometheus]# kubectl get all -n monitoring
NAME                          DESIRED   CURRENT   AGE
statefulsets/prometheus-k8s   0         2         16h

NAME                  READY     STATUS             RESTARTS   AGE
po/prometheus-k8s-0   0/1       CrashLoopBackOff   81         16h
po/prometheus-k8s-1   0/1       CrashLoopBackOff   22         16h

Environment

[root@k8s-1 prometheus]# kubectl version --short
Client Version: v1.8.5
Server Version: v1.8.5
[root@k8s-1 prometheus]# docker images | grep -i prometheus
quay.io/prometheus/alertmanager                          v0.12.0             f87cbd5f1360        5 weeks ago         31.2 MB
quay.io/prometheus/node_exporter                         v0.15.2             ff5ecdcfc4a2        6 weeks ago         22.8 MB
quay.io/prometheus/prometheus                            v2.0.0              67141fa03496        2 months ago        80.2 MB
  • System information:

        [root@k8s-1 prometheus]# uname -srm
        Linux 3.10.0-229.el7.x86_64 x86_64
        ```
    
    
  • Prometheus version:

    v2.0.0

  • Prometheus configuration file:

[root@k8s-1 prometheus]# cat prometheus-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-k8s-config
  namespace: monitoring
data:
  prometheus.yaml: |
    global:
      scrape_interval: 10s
      scrape_timeout: 10s
      evaluation_interval: 10s
    rule_files:
      - "/etc/prometheus-rules/*.rules"
      
    # A scrape configuration for running Prometheus on a Kubernetes cluster.
    # This uses separate scrape configs for cluster components (i.e. API server, node)
    # and services to allow each to use different authentication configs.
    #
    # Kubernetes labels will be added as Prometheus labels on metrics via the
    # `labelmap` relabeling action.
    #
    # If you are using Kubernetes 1.7.2 or earlier, please take note of the comments
    # for the kubernetes-cadvisor job; you will need to edit or remove this job.
    
    # Scrape config for API servers.
    #
    # Kubernetes exposes API servers as endpoints to the default/kubernetes
    # service so this uses `endpoints` role and uses relabelling to only keep
    # the endpoints associated with the default/kubernetes service using the
    # default named port `https`. This works for single API server deployments as
    # well as HA API server deployments.
    scrape_configs:
    - job_name: 'kubernetes-apiservers'
    
      kubernetes_sd_configs:
      - role: endpoints
    
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
    
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        # If your node certificates are self-signed or use a different CA to the
        # master CA, then disable certificate verification below. Note that
        # certificate verification is an integral part of a secure infrastructure
        # so this should only be disabled in a controlled environment. You can
        # disable certificate verification by uncommenting the line below.
        #
        # insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    
      # Keep only the default/kubernetes service endpoints for the https port. This
      # will add targets for each API server which Kubernetes adds an endpoint to
      # the default/kubernetes service.
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    
    # Scrape config for nodes (kubelet).
    #
    # Rather than connecting directly to the node, the scrape is proxied though the
    # Kubernetes apiserver.  This means it will work if Prometheus is running out of
    # cluster, or can't connect to nodes for some other reason (e.g. because of
    # firewalling).
    - job_name: 'kubernetes-nodes'
    
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
    
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    
      kubernetes_sd_configs:
      - role: node
    
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics
    
    # Scrape config for Kubelet cAdvisor.
    #
    # This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
    # (those whose names begin with 'container_') have been removed from the
    # Kubelet metrics endpoint.  This job scrapes the cAdvisor endpoint to
    # retrieve those metrics.
    #
    # In Kubernetes 1.7.0-1.7.2, these metrics are only exposed on the cAdvisor
    # HTTP endpoint; use "replacement: /api/v1/nodes/${1}:4194/proxy/metrics"
    # in that case (and ensure cAdvisor's HTTP server hasn't been disabled with
    # the --cadvisor-port=0 Kubelet flag).
    #
    # This job is not necessary and should be removed in Kubernetes 1.6 and
    # earlier versions, or it will cause the metrics to be scraped twice.
    - job_name: 'kubernetes-cadvisor'
    
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
    
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    
      kubernetes_sd_configs:
      - role: node
    
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    
    # Scrape config for service endpoints.
    #
    # The relabeling allows the actual service scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
    # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
    # to set this to `https` & most likely set the `tls_config` of the scrape config.
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: If the metrics are exposed on a different port to the
    # service then set this appropriately.
    - job_name: 'kubernetes-service-endpoints'
    
      kubernetes_sd_configs:
      - role: endpoints
    
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name
    
    # Example scrape config for probing services via the Blackbox Exporter.
    #
    # The relabeling allows the actual service scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/probe`: Only probe services that have a value of `true`
    - job_name: 'kubernetes-services'
    
      metrics_path: /probe
      params:
        module: [http_2xx]
    
      kubernetes_sd_configs:
      - role: service
    
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter.example.com:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name
    
    # Example scrape config for probing ingresses via the Blackbox Exporter.
    #
    # The relabeling allows the actual ingress scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/probe`: Only probe services that have a value of `true`
    - job_name: 'kubernetes-ingresses'
    
      metrics_path: /probe
      params:
        module: [http_2xx]
    
      kubernetes_sd_configs:
        - role: ingress
    
      relabel_configs:
        - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
          regex: (.+);(.+);(.+)
          replacement: ${1}://${2}${3}
          target_label: __param_target
        - target_label: __address__
          replacement: blackbox-exporter.example.com:9115
        - source_labels: [__param_target]
          target_label: instance
        - action: labelmap
          regex: __meta_kubernetes_ingress_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_ingress_name]
          target_label: kubernetes_name
    
    # Example scrape config for pods
    #
    # The relabeling allows the actual pod scrape endpoint to be configured via the
    # following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
    # pod's declared ports (default is a port-free target if none are declared).
    - job_name: 'kubernetes-pods'
    
      kubernetes_sd_configs:
      - role: pod
    
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
[root@k8s-1 prometheus]# cat prometheus-all-together.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    prometheus: k8s
  name: prometheus-k8s
  namespace: monitoring
  annotations:
    prometheus.io/scrape: "true"
spec:
  ports:
  - name: web
    nodePort: 30900
    port: 9090
    protocol: TCP
    targetPort: web
  selector:
    prometheus: k8s
  sessionAffinity: None
  type: NodePort
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  labels:
    prometheus: k8s
  name: prometheus-k8s
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: prometheus
      prometheus: k8s
  serviceName: prometheus-k8s
  replicas: 2
  template:
    metadata:
      labels:
        app: prometheus
        prometheus: k8s
    spec:
      securityContext:
        runAsUser: 65534
        fsGroup: 65534
        runAsNonRoot: true
      containers:
      - args:
        - --config.file=/etc/prometheus/config/prometheus.yaml
        - --storage.tsdb.path=/cephfs/prometheus/data
        - --storage.tsdb.retention=180d
        - --web.route-prefix=/
        - --web.enable-lifecycle
        - --web.enable-admin-api
        image: quay.io/prometheus/prometheus:v2.0.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /status
            port: web
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        name: prometheus
        ports:
        - containerPort: 9090
          name: web
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /status
            port: web
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        resources:
          requests:
            cpu: 100m
            memory: 200Mi
          limits:
            cpu: 500m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/prometheus/config
          name: config
          readOnly: false
        - mountPath: /etc/prometheus/rules
          name: rules
          readOnly: false
        - mountPath: /cephfs/prometheus/data
          name: data
          subPath: prometheus-data
          readOnly: false
      serviceAccount: prometheus-k8s
      serviceAccountName: prometheus-k8s
      terminationGracePeriodSeconds: 60
      volumes:
      - configMap:
          defaultMode: 511
          name: prometheus-k8s-config
        name: config
      - configMap:
          defaultMode: 511
          name: prometheus-k8s-rules
        name: rules
      - name: data
        persistentVolumeClaim:
          claimName: cephfs-pvc
  updateStrategy:
    type: RollingUpdate
  • Logs:
[root@k8s-1 prometheus]# kubectl logs prometheus-k8s-0 -n monitoring
level=info ts=2018-01-20T03:16:32.966070249Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2018-01-20T03:16:32.966225361Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2018-01-20T03:16:32.966252185Z caller=main.go:217 host_details="(Linux 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 prometheus-k8s-0 (none))"
level=info ts=2018-01-20T03:16:32.969789371Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-20T03:16:32.971388907Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2018-01-20T03:16:32.971596811Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2018-01-20T03:16:59.781338012Z caller=main.go:323 msg="Opening storage failed" err="invalid block sequence: block time ranges overlap (1516348800000, 1516356000000)"
[root@k8s-1 prometheus]# 
[root@k8s-1 prometheus]# kubectl logs prometheus-k8s-1 -n monitoring
level=info ts=2018-01-20T03:15:22.701351679Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2018-01-20T03:15:22.70148418Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2018-01-20T03:15:22.701512333Z caller=main.go:217 host_details="(Linux 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 prometheus-k8s-1 (none))"
level=info ts=2018-01-20T03:15:22.705824203Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-20T03:15:22.707629775Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2018-01-20T03:15:22.707837323Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2018-01-20T03:15:54.775639791Z caller=main.go:323 msg="Opening storage failed" err="invalid block sequence: block time ranges overlap (1516348800000, 1516356000000)"
[root@k8s-1 prometheus]# kubectl describe po/prometheus-k8s-0 -n monitoring
Name:           prometheus-k8s-0
Namespace:      monitoring
Node:           k8s-3/172.16.1.8
Start Time:     Fri, 19 Jan 2018 17:59:38 +0800
Labels:         app=prometheus
                controller-revision-hash=prometheus-k8s-7d86dfbd86
                prometheus=k8s
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"StatefulSet","namespace":"monitoring","name":"prometheus-k8s","uid":"7593d8ac-fcff-11e7-9333-fa163e48f857"...
Status:         Running
IP:             10.244.2.54
Created By:     StatefulSet/prometheus-k8s
Controlled By:  StatefulSet/prometheus-k8s
Containers:
  prometheus:
    Container ID:  docker://98faabe55fb71050aacd776d349a6567c25c339117159356eedc10cbc19ef02a
    Image:         quay.io/prometheus/prometheus:v2.0.0
    Image ID:      docker-pullable://quay.io/prometheus/prometheus@sha256:53afe934a8d497bb703dbbf7db273681a56677775c462833da8d85015471f7a3
    Port:          9090/TCP
    Args:
      --config.file=/etc/prometheus/config/prometheus.yaml
      --storage.tsdb.path=/cephfs/prometheus/data
      --storage.tsdb.retention=180d
      --web.route-prefix=/
      --web.enable-lifecycle
      --web.enable-admin-api
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 20 Jan 2018 11:11:00 +0800
      Finished:     Sat, 20 Jan 2018 11:11:29 +0800
    Ready:          False
    Restart Count:  84
    Limits:
      cpu:     500m
      memory:  500Mi
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get http://:web/status delay=30s timeout=3s period=5s #success=1 #failure=10
    Readiness:    http-get http://:web/status delay=0s timeout=3s period=5s #success=1 #failure=6
    Environment:  <none>
    Mounts:
      /cephfs/prometheus/data from data (rw)
      /etc/prometheus/config from config (rw)
      /etc/prometheus/rules from rules (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-x8xzh (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-config
    Optional:  false
  rules:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-rules
    Optional:  false
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cephfs-pvc
    ReadOnly:   false
  prometheus-k8s-token-x8xzh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-token-x8xzh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason      Age                  From            Message
  ----     ------      ----                 ----            -------
  Normal   Pulled      15m (x83 over 17h)   kubelet, k8s-3  Container image "quay.io/prometheus/prometheus:v2.0.0" already present on machine
  Warning  FailedSync  23s (x1801 over 7h)  kubelet, k8s-3  Error syncing pod

Any suggestions?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 29 (4 by maintainers)

Most upvoted comments

Here’s how it went for me (running docker container with prom/prometheus:v2.3.0). OS was rebooted (manually), after reboot prometheus kept restarting with level=error ts=2018-07-09T09:44:19.761219359Z caller=main.go:597 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1530856800000, maxt: 1530864000000, range: 2h0m0s, blocks: 2]: <ulid: 01CHQD40DG2QE2ZE3MFMMQ1VFS, mint: 1530856800000, maxt: 1530864000000, range: 2h0m0s>, <ulid: 01CHZ45KDMB5S64X6R3AQMWSXD, mint: 1530856800000, maxt: 1530878400000, range: 6h0m0s>\n[mint: 1530871200000, maxt: 1530878400000, range: 2h0m0s, blocks: 2]: <ulid: 01CHZ45KDMB5S64X6R3AQMWSXD, mint: 1530856800000, maxt: 1530878400000, range: 6h0m0s>, <ulid: 01CHQTVEXG910WRSSS7S6D264W, mint: 1530871200000, maxt: 1530878400000, range: 2h0m0s>" I stopped the container and checked the volume data:

# ls -lh
total 36K
drwxr-xr-x 3 nobody nogroup 4.0K Jul  5 09:00 01CHMTQB0CQNF49HZ7CNR2105S
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 03:00 01CHPRGWBMQJVZ626S20P5QRB6
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 09:00 01CHQD40DG2QE2ZE3MFMMQ1VFS
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 09:00 01CHQD4138KK0ZTADMA90MT9N8
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 13:00 01CHQTVEXG910WRSSS7S6D264W
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 15:00 01CHR1Q65F38PZFHJZP1WG3PZ8
drwxr-xr-x 3 nobody nogroup 4.0K Jul  9 09:04 01CHZ45KDMB5S64X6R3AQMWSXD
drwxr-xr-x 3 nobody nogroup 4.0K Jul  9 09:04 01CHZ4JT0317TT5HYKKZKW24BJ.tmp
-rw-rw-r-- 1 nobody nogroup    0 Jul  9 09:22 lock
drwxr-xr-x 2 nobody nogroup 4.0K Jul  6 13:00 wal

# du -sh 01*
99M	01CHMTQB0CQNF49HZ7CNR2105S
94M	01CHPRGWBMQJVZ626S20P5QRB6
12M	01CHQD40DG2QE2ZE3MFMMQ1VFS
34M	01CHQD4138KK0ZTADMA90MT9N8
12M	01CHQTVEXG910WRSSS7S6D264W
13M	01CHR1Q65F38PZFHJZP1WG3PZ8
29G	01CHZ45KDMB5S64X6R3AQMWSXD
27G	01CHZ4JT0317TT5HYKKZKW24BJ.tmp

Note the last two directories, they’re the heaviest. If you check ulids mentioned in logs, you will notice they match names of directories. After messing around a little with moving away smaller directories with IDs from logs, I ended up with the same message in logs: somehow, prometheus encounters same time ranges in different chunks from different directories (my speculations only, no idea what kind of satanic magic it runs by). So I did what seemed to be logical: created a backup directory, moved there everything except for wal directory and the latest (heaviest) non-.tmp directory. So it looked like this:

# ls -lh
total 16K
drwxr-xr-x 3 nobody nogroup 4.0K Jul  9 09:04 01CHZ45KDMB5S64X6R3AQMWSXD
drwxr-xr-x 8 root   root    4.0K Jul  9 09:45 bkp
-rw-rw-r-- 1 nobody nogroup    0 Jul  9 09:22 lock
drwxr-xr-x 2 nobody nogroup 4.0K Jul  6 13:00 wal

Started prometheus again, voila, works again, and the data is there and accessible (I can see it by running queries from the very beginning of monitoring history). Hope it’d help somebody.

Sorry about that, this is a bug, fix is here: https://github.com/prometheus/tsdb/pull/299 A new bug fix release will be out soon.

Is there a way to recover from this error without flushing data out? I don’t want to lose a chunk of my metrics data because of this 😐

i just checked the logs further and the issue already appeared without crashing at runtime before:

We also have 2.2.0 and this issue has few additional symptoms:

  1. it could crash after a while and even refusing to start after that
  2. mostly such errors are also coming with going “all crazy on IOPS” (i.e. huge I/O without any external workload)
  3. looks like it doens’t related to “long history” in TSDB and could apeear even on fresh instance with small amount of samples in TSDB (i.e. within an day after wiping all data and restarting Prometheus)

I hope it could help diagnose a problem.