prometheus: Ingestion stops, probably due to deadlocked series maintenance

hi folks,

i update prometheus form 0.16.2 to 0.17.0. i try to reuse the old prometheus configuration and the data. but i got the error in status page, i can’t get any samples.

my configuration is very simple

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    scrape_timeout: 10s

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    target_groups:
      - targets: ['localhost:9090']

  - job_name: 'overwritten-default'
    scrape_interval: 5s
    scrape_timeout: 10s
    consul_sd_configs:
      - server: 'consul server'

    relabel_configs:
      - source_labels: ['__meta_consul_service_id']
        regex:         '(.*)'
        target_label:  'job'
        replacement:   '$1'
        action:        'replace'
      - source_labels: ['__meta_consul_service_address','__meta_consul_service_port']
        separator:     ';'
        regex:         '(.*);(.*)'
        target_label:  '__address__'
        replacement:   '$1:$2'
        action:        'replace'
      - source_labels: ['__meta_consul_service_id']
        regex:         '^prometheus_.*'
        action:        'keep'

there is not any useful debug log or hints. what am i lost?

thanks.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 34 (16 by maintainers)

Most upvoted comments

In case it gets into that state again, a goroutine dump would be great. Then we could see which goroutine is deadlocked, if any. You get it with

curl http://your-prometheus-server:9090/debug/pprof/goroutine?debug=2

Another explanation would be if your server is stuck in writing a checkpoint file, e.g. because the underlying disk is very slow or blocked. (Perhaps that could happen on Amazon or other cloud providers if you are running out of IOps quota?)

beorn7 on Mar 4, 2016