prometheus: Ingestion stops, probably due to deadlocked series maintenance
hi folks,
i update prometheus form 0.16.2
to 0.17.0
. i try to reuse the old prometheus configuration and the data. but i got the error in status page, i can’t get any samples.
my configuration is very simple
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
scrape_timeout: 10s
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
target_groups:
- targets: ['localhost:9090']
- job_name: 'overwritten-default'
scrape_interval: 5s
scrape_timeout: 10s
consul_sd_configs:
- server: 'consul server'
relabel_configs:
- source_labels: ['__meta_consul_service_id']
regex: '(.*)'
target_label: 'job'
replacement: '$1'
action: 'replace'
- source_labels: ['__meta_consul_service_address','__meta_consul_service_port']
separator: ';'
regex: '(.*);(.*)'
target_label: '__address__'
replacement: '$1:$2'
action: 'replace'
- source_labels: ['__meta_consul_service_id']
regex: '^prometheus_.*'
action: 'keep'
there is not any useful debug log or hints. what am i lost?
thanks.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 34 (16 by maintainers)
In case it gets into that state again, a goroutine dump would be great. Then we could see which goroutine is deadlocked, if any. You get it with
curl http://your-prometheus-server:9090/debug/pprof/goroutine?debug=2
Another explanation would be if your server is stuck in writing a checkpoint file, e.g. because the underlying disk is very slow or blocked. (Perhaps that could happen on Amazon or other cloud providers if you are running out of IOps quota?)