prometheus: Bug in ConsulSD when using multiple jobs in same config.

What did you do?

Added multiple jobs, both using ConsulSD.

For example:

   - job_name: 'myconsuljob1'
      scrape_interval: 1m
      scrape_timeout: 55s
      consul_sd_configs:
      - server: 'my1.foo.com:8500'
      ...

   - job_name: 'myconsuljob2'
      scrape_interval: 1m
      scrape_timeout: 55s
      consul_sd_configs:
      - server: 'my2.foo.com:8500'

What did you expect to see?

Both jobs would discover properly

What did you see instead? Under which circumstances?

Only 1st ConsulSD job in the config seems to work - if I switch the order in the config-it will give me the other set of discovered hosts. If I change one of the target consul servers, it seems to discover both until the Prometheus server gets restarted, then it reverts back to only discovering the first job in the config.

Environment

  • System information:

Linux 3.10.0-514.21.1.el7.x86_64 x86_64

  • Prometheus version:

1.7.1

  • Prometheus configuration file:
    global:
      scrape_interval: 60s
      scrape_timeout: 55s
      evaluation_interval: 55s
      external_labels:
        k8s_datacenter: xxx
        k8s_cluster: xxx

    rule_files:
    - "/etc/config/*.rule"

    scrape_configs:
    - job_name: ingress-check
      metrics_path: /probe
      params:
        module:
        - http_2xx
      relabel_configs:
      - regex: (.*)(:80)?
        replacement: ${1}
        source_labels:
        - __address__
        target_label: __param_target
      - regex: (.*)
        replacement: ${1}
        source_labels:
        - __param_target
        target_label: instance
      - regex: .*
        replacement: blackbox-prod:9115
        source_labels: []
        target_label: __address__
      static_configs:
      - labels:
          sourceenv: xx1
          sourcesvc: ingress
        targets:
        - https://xxx/
      - labels:
          sourceenv: xx2
          sourcesvc: ingress
        targets:
        - https://xxx/
      - labels:
          sourceenv: xx1
          sourcesvc: consul
        xxfooxxs:
        - http://xx11001.xxfooxx.com:8500/v1/status/leader
        - http://xx11002.xxfooxx.com:8500/v1/status/leader
        - http://xx11003.xxfooxx.com:8500/v1/status/leader
      - labels:
          sourceenv: xx2-prod
          sourcesvc: consul
        xxfooxxs:
        - http://xx21001.xxfooxx.com:8500/v1/status/leader
        - http://xx21002.xxfooxx.com:8500/v1/status/leader
        - http://xx21003.xxfooxx.com:8500/v1/status/leader
      - labels:
          sourceenv: xx1-prod
          sourcesvc: etcd
        targets:
        - http://10.XX.XX.XXX:2379/health
        - http://10.XX.XX.XXY:2379/health
        - http://10.XX.XX.XXZ:2379/health
      - labels:
          sourceenv: xx2-prod
          sourcesvc: etcd
        targets:
        - http://10.XX.XX.XXX:2379/health
        - http://10.XX.XX.XXY:2379/health
        - http://10.XX.XX.XXZ:2379/health

    - job_name: 'xx2-federate'
      scheme: https
      tls_config:
        insecure_skip_verify: true
      honor_labels: true
      metrics_path: '/federate'
      params:
        'match[]':
        - '{job="kubernetes-apiservers"}'
        - '{job="kubernetes-service-endpoints"}'
        - '{job="kubernetes-pods"}'
        - '{job="kubernetes-nodes"}'
      static_configs:
        - targets:
          - 'prometheus.us-central-1xx2.core'

    - job_name: 'xx1-federate'
      scheme: https
      tls_config:
        insecure_skip_verify: true
      honor_labels: true
      metrics_path: '/federate'
      scheme: https
      params:
        'match[]':
        - '{job="kubernetes-apiservers"}'
        - '{job="kubernetes-service-endpoints"}'
        - '{job="kubernetes-pods"}'
        - '{job="kubernetes-nodes"}'
      static_configs:
        - targets:
          - 'prometheus.us-central-1xx1.core'

    - job_name: 'k8s-metal-xx2'
      scrape_interval: 2m
      scrape_timeout: 115s
      consul_sd_configs:
      - server: 'xx21002.xxfooxx:8500'
        services: ['nomad-client']
        scheme: http
      relabel_configs:
      - source_labels: [__meta_sd_consul_tags]
        separator:     ','
        regex:         label:([^=]+)=([^,]+)
        target_label:  ${1}
        replacement:   ${2}
      - source_labels: ['__address__']
        separator:     ':'
        regex:         '(.*):(4646)'
        target_label:  '__address__'
        replacement:   '${1}:9101'

    - job_name: 'k8s-metal-xx1'
      scrape_interval: 2m
      scrape_timeout: 115s
      consul_sd_configs:
      - server: 'xx11003.xxfooxx:8500'
        services: ['nomad-client']
        scheme: http
      relabel_configs:
      - source_labels: [__meta_sd_consul_tags]
        separator:     ','
        regex:         label:([^=]+)=([^,]+)
        target_label:  ${1}
        replacement:   ${2}
      - source_labels: ['__address__']
        separator:     ':'
        regex:         '(.*):(4646)'
        target_label:  '__address__'
        replacement:   '${1}:9101'

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 28 (17 by maintainers)

Most upvoted comments

I’ve seen prometheus instances scrapping thousands of consul targets (the limit really is the memory necessary for metric names and points, not consul here). For a consul cluster itself I’ve seen 5+ datacenter together with tens of thousands of nodes. Most of the patches to achieve that are upstream, some are on https://github.com/criteo-forks/consul/tree/1.2.2-criteo

@krasi-georgiev I’m reasonably sure the bug is specific to the consul sd implementation and hasn’t been fixed in dev-2.0. Please feel free to give it a shot here, we’ll release Prometheus v1.8 before the big 2.0 release. I’d love to review a PR to fix this bug (mention me in the PR in case you’re able to find and fix it).