prometheus: External Labels not available in alert Annotations

What did you do?

Defined an external label in prometheus config:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    monitor: acme-logs-dev

In a rule I had:

ALERT container_eating_memory
  IF sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (instance, name) > 2500000000
  FOR 5m
  ANNOTATIONS {
    description="{{ $labels.container_label_com_docker_swarm_task_name }} is eating up a LOT of memory. Memory consumption of {{ $labels.container_label_com_docker_swarm_task_name }} is at {{ humanize $value}}.", 
    summary="{{$labels.monitor}} - HIGH MEMORY USAGE WARNING: TASK '{{ $labels.container_label_com_docker_swarm_task_name }}' on '{{ $labels.instance }}'"}

What did you expect to see? The templated alert printing the monitor label.

But the alert in both prometheus alert view and the alert that ended up in slack were both missing the external label

What did you see instead? Under which circumstances?

No label listed in prometheus alerts page and empty string in slack

Environment

  • System information:

  • Prometheus version:

    1.7.1

  • Alertmanager version:

    0.8.0

  • Prometheus configuration file:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    monitor: acme-logs-dev
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager.service.acme:9093
    scheme: http
    timeout: 10s
rule_files:
- /etc/prometheus/tasks.rules
- /etc/prometheus/host.rules
- /etc/prometheus/containers.rules
scrape_configs:
- job_name: cadvisor
  scrape_interval: 5s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  dns_sd_configs:
  - names:
    - tasks.cadvisor
    refresh_interval: 30s
    type: A
    port: 8080
- job_name: node-exporter
  scrape_interval: 5s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  dns_sd_configs:
  - names:
    - tasks.nodeexporter
    refresh_interval: 30s
    type: A
    port: 9100
- job_name: prometheus
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - localhost:9090
- job_name: blackbox-http
  params:
    module:
    - http_2xx
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /probe
  scheme: http
  dns_sd_configs:
  - names:
    - tasks.kibana.service.acme
    refresh_interval: 30s
    type: A
    port: 5601
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*)(:80)?
    target_label: __param_target
    replacement: ${1}
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: ${1}
    action: replace
  - source_labels: []
    separator: ;
    regex: .*
    target_label: __address__
    replacement: blackboxexporter:9115
    action: replace
  - source_labels: [__meta_dns_name]
    separator: ;
    regex: tasks\.(.*)
    target_label: job
    replacement: ${1}
    action: replace
- job_name: blackbox-tcp
  params:
    module:
    - tcp_connect
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /probe
  scheme: http
  dns_sd_configs:
  - names:
    - tasks.elasticsearch.service.acme
    refresh_interval: 30s
    type: A
    port: 9200
  - names:
    - tasks.logstash.service.acme
    refresh_interval: 30s
    type: A
    port: 5000
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*)(:80)?
    target_label: __param_target
    replacement: ${1}
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: ${1}
    action: replace
  - source_labels: []
    separator: ;
    regex: .*
    target_label: __address__
    replacement: blackboxexporter:9115
    action: replace
  - source_labels: [__meta_dns_name]
    separator: ;
    regex: tasks\.(.*)
    target_label: job
    replacement: ${1}
    action: replace
  • Alertmanager configuration file:
route:
  receiver: 'slack'
  repeat_interval: 3h #3h
  group_interval: 5m #5m
  group_wait: 1m #1m
  routes:
  #- receiver: 'logstash'
  #  continue: true
  - receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - send_resolved: true
        api_url: 'xxx'
        username: 'Prometheus - Alerts'
        channel: '#service-alerts'
        title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
        icon_emoji: ':dart:'
  - name: 'logstash'
    webhook_configs:
      # Whether or not to notify about resolved alerts.
      - send_resolved: true
        # The endpoint to send HTTP POST requests to.
        url: 'http://logstash:8080/'

  • Logs:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 52 (44 by maintainers)

Most upvoted comments

Having just implemented another work around for this for the _n_th time at SoundCloud triggered me to do a quick straw poll among the Prometheus people here. Everybody wants this feature.

I don’t know if we should continue the discussion. My suspicion is that all arguments had been tabled already. If there is still no consensus of doing this, we could do a formal vote in prometheus-team@.

In general the point stands that an alert should not care about the Prometheus it is running in.

I think the point, at least for us, is that humans do care about the Prometheus that alerts are generated from even if you don’t or you don’t think they should. Even if we ignore this specific scenario the fact remains that there can be external labels or alert label rewrites that can add contextual information for humans that are receiving the alerts which can be helpful. If you are taking a quick glance at multiple alerts, the summary or description annotations seem to be the two best places to put this type of useful information.

Some, but not all, labels being available is confusing at best. If the human decides that combining this information in a specific way is useful to them and their organization then isn’t that something worth considering? If you can get this done in a single place (the alert definition) rather than multiple places, why wouldn’t you? It’s generally what the user would expect.

Finally fixed by #5463 .

I’m not quite getting what you’re suggesting here - how are the labels being used in step 1? Having external labels go before template expansion will break cases where the alert produces a labelname that’s in the external label but the alert templating was either removing it or depending on it not being there.

As said, my suggestion includes that the label set that includes external labels is accessed under a different name. That should cover the case where a template depends on the external label not being there.

The case where an alerting expression creates a label that is then removed by the labels section in the alert so that it can then be added again via the external labels would indeed be changed by my suggestion. In case that’s of any practical relevance, we can still go down the road of a feature flag.

In addition I think that annotations and label templating should both have access to the exact same inputs, as it’d be confusing otherwise (and we’ve had requests for each to depend on the other so there’s no obvious “right order”).

We could define an order if that helps getting a useful feature in.

BTW: In which form would people make labels depend on annotations? Was that a request you deemed reasonable or are you just bringing it up to needlessly complicate the discussion?

You really have dashboards within a team/receiver that have a non-trivial number of Grafana variable names for the the zone and other similar labels?

Some have one, some hove none, some use PromDash, some might use something completely different.

Having to add this manually to every single alerting rule is a sign of a layering violation.

I don’t have to add it. It’s in the external_labels of the Prometheus server. I just want to access it to generate descriptions that read nicely.

For both topics above, I declare my discussion quota exhausted. I guess there are not many I need to convince here, perhaps Brian is the only one. My ambition to do so is limited. And so are my resources.

A different story are actual technical issues, which I will continue to discuss, see next comment.

It’s still a breaking change per our stability guarantees. How easy a change is to deal with in principle doesn’t stop it from being a breaking change.

Which still doesn’t change that it has very little weight in this discussion.

This is by design, you want alert notification templates.

I think there are some valid use cases where doing this with alert notification templates isn’t desirable. For instance, consider dashboard URLs that might contain the datacenter (an external label). Some alerts might include dashboard URLs with datacenter, some might not. And it would be super usefule for the URLs that appear in the Prometheus to contain correct datacenters, and not contain place holders.

Is there are way of currently doing this or should this issue be reopened?