prometheus: Bug in ConsulSD when using multiple jobs in same config.
What did you do?
Added multiple jobs, both using ConsulSD.
For example:
- job_name: 'myconsuljob1'
scrape_interval: 1m
scrape_timeout: 55s
consul_sd_configs:
- server: 'my1.foo.com:8500'
...
- job_name: 'myconsuljob2'
scrape_interval: 1m
scrape_timeout: 55s
consul_sd_configs:
- server: 'my2.foo.com:8500'
What did you expect to see?
Both jobs would discover properly
What did you see instead? Under which circumstances?
Only 1st ConsulSD job in the config seems to work - if I switch the order in the config-it will give me the other set of discovered hosts. If I change one of the target consul servers, it seems to discover both until the Prometheus server gets restarted, then it reverts back to only discovering the first job in the config.
Environment
- System information:
Linux 3.10.0-514.21.1.el7.x86_64 x86_64
- Prometheus version:
1.7.1
- Prometheus configuration file:
global:
scrape_interval: 60s
scrape_timeout: 55s
evaluation_interval: 55s
external_labels:
k8s_datacenter: xxx
k8s_cluster: xxx
rule_files:
- "/etc/config/*.rule"
scrape_configs:
- job_name: ingress-check
metrics_path: /probe
params:
module:
- http_2xx
relabel_configs:
- regex: (.*)(:80)?
replacement: ${1}
source_labels:
- __address__
target_label: __param_target
- regex: (.*)
replacement: ${1}
source_labels:
- __param_target
target_label: instance
- regex: .*
replacement: blackbox-prod:9115
source_labels: []
target_label: __address__
static_configs:
- labels:
sourceenv: xx1
sourcesvc: ingress
targets:
- https://xxx/
- labels:
sourceenv: xx2
sourcesvc: ingress
targets:
- https://xxx/
- labels:
sourceenv: xx1
sourcesvc: consul
xxfooxxs:
- http://xx11001.xxfooxx.com:8500/v1/status/leader
- http://xx11002.xxfooxx.com:8500/v1/status/leader
- http://xx11003.xxfooxx.com:8500/v1/status/leader
- labels:
sourceenv: xx2-prod
sourcesvc: consul
xxfooxxs:
- http://xx21001.xxfooxx.com:8500/v1/status/leader
- http://xx21002.xxfooxx.com:8500/v1/status/leader
- http://xx21003.xxfooxx.com:8500/v1/status/leader
- labels:
sourceenv: xx1-prod
sourcesvc: etcd
targets:
- http://10.XX.XX.XXX:2379/health
- http://10.XX.XX.XXY:2379/health
- http://10.XX.XX.XXZ:2379/health
- labels:
sourceenv: xx2-prod
sourcesvc: etcd
targets:
- http://10.XX.XX.XXX:2379/health
- http://10.XX.XX.XXY:2379/health
- http://10.XX.XX.XXZ:2379/health
- job_name: 'xx2-federate'
scheme: https
tls_config:
insecure_skip_verify: true
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="kubernetes-apiservers"}'
- '{job="kubernetes-service-endpoints"}'
- '{job="kubernetes-pods"}'
- '{job="kubernetes-nodes"}'
static_configs:
- targets:
- 'prometheus.us-central-1xx2.core'
- job_name: 'xx1-federate'
scheme: https
tls_config:
insecure_skip_verify: true
honor_labels: true
metrics_path: '/federate'
scheme: https
params:
'match[]':
- '{job="kubernetes-apiservers"}'
- '{job="kubernetes-service-endpoints"}'
- '{job="kubernetes-pods"}'
- '{job="kubernetes-nodes"}'
static_configs:
- targets:
- 'prometheus.us-central-1xx1.core'
- job_name: 'k8s-metal-xx2'
scrape_interval: 2m
scrape_timeout: 115s
consul_sd_configs:
- server: 'xx21002.xxfooxx:8500'
services: ['nomad-client']
scheme: http
relabel_configs:
- source_labels: [__meta_sd_consul_tags]
separator: ','
regex: label:([^=]+)=([^,]+)
target_label: ${1}
replacement: ${2}
- source_labels: ['__address__']
separator: ':'
regex: '(.*):(4646)'
target_label: '__address__'
replacement: '${1}:9101'
- job_name: 'k8s-metal-xx1'
scrape_interval: 2m
scrape_timeout: 115s
consul_sd_configs:
- server: 'xx11003.xxfooxx:8500'
services: ['nomad-client']
scheme: http
relabel_configs:
- source_labels: [__meta_sd_consul_tags]
separator: ','
regex: label:([^=]+)=([^,]+)
target_label: ${1}
replacement: ${2}
- source_labels: ['__address__']
separator: ':'
regex: '(.*):(4646)'
target_label: '__address__'
replacement: '${1}:9101'
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 28 (17 by maintainers)
I’ve seen prometheus instances scrapping thousands of consul targets (the limit really is the memory necessary for metric names and points, not consul here). For a consul cluster itself I’ve seen 5+ datacenter together with tens of thousands of nodes. Most of the patches to achieve that are upstream, some are on https://github.com/criteo-forks/consul/tree/1.2.2-criteo
@krasi-georgiev I’m reasonably sure the bug is specific to the consul sd implementation and hasn’t been fixed in dev-2.0. Please feel free to give it a shot here, we’ll release Prometheus v1.8 before the big 2.0 release. I’d love to review a PR to fix this bug (mention me in the PR in case you’re able to find and fix it).