alertmanager: Duplicate notifications after upgrading to Alertmanager 0.15

After upgrading from Alertmanager 0.13 to 0.15.2 in a cluster of two members we’ve started receiving double notifications in slack. It used to work flawlessly with 0.13. Weirdly we’re receiving the 2 notifications exactly at the same time, they don’t seem to be apart by more than a couple of secs.

System information:

Linux pmm-server 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux

Both instances using ntp.

Alertmanager version:

alertmanager, version 0.15.2 (branch: HEAD, revision: d19fae3bae451940b8470abb680cfdd59bfa7cfa) build user: root@3101e5b68a55 build date: 20180814-10:53:39 go version: go1.10.3

Cluster status reports up:

Status Uptime: 2018-09-09T19:03:01.726517546Z Cluster Status Name: 01CPZVEFADF9GE2G9F2CTZZZQ6 Status: ready Peers: Name: 01CPZV0HDRQY5M5TW6FDS31MKS Address: <secret>:9094 Name: 01CPZVEFADF9GE2G9F2CTZZZQ6 Address: <secret>:9094

Prometheus version:

Irrelevant

Alertmanager configuration file:

global:
  resolve_timeout: 5m
  http_config: {}
  smtp_hello: localhost
  smtp_require_tls: true
  slack_api_url: <secret>
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  hipchat_api_url: https://api.hipchat.com/
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: ops-slack
  group_by:
  - alertname
  - group
  routes:
  - receiver: ops-pager
    match:
      alertname: MySQLDown
  - receiver: ops-pager
    match:
      offline: critical
    continue: true
    routes:
    - receiver: ops-slack
      match:
        offline: critical
  - receiver: ops-pager
    match:
      pager: "yes"
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m
receivers:
- name: ops-slack
  slack_configs:
  - send_resolved: true
    http_config: {}
    api_url: <secret>
    channel: alerts
    username: prometheus
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
      | len }}{{ end }}] {{ .CommonAnnotations.summary }}'
    title_link: '{{ template "slack.default.titlelink" . }}'
    pretext: '{{ template "slack.default.pretext" . }}'
    text: |-
      {{ range .Alerts }}
        *Alert:* {{ .Annotations.summary }} - *{{ .Labels.severity | toUpper }}* on {{ .Labels.instance }}
        *Description:* {{ .Annotations.description }}
        *Details:*
        {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
        {{ end }}
      {{ end }}
    footer: '{{ template "slack.default.footer" . }}'
    fallback: '{{ template "slack.default.fallback" . }}'
    icon_emoji: '{{ template "slack.default.iconemoji" . }}'
    icon_url: http://cdn.rancher.com/wp-content/uploads/2015/05/27094511/prometheus-logo-square.png

Logs: no errors to speak of

level=info ts=2018-09-06T11:10:16.242620478Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.2, branch=HEAD, revision=d19fae3bae451940b8470abb680cfdd59bfa7cfa)"
level=info ts=2018-09-06T11:10:16.242654842Z caller=main.go:175 build_context="(go=go1.10.3, user=root@3101e5b68a55, date=20180814-10:53:39)"
level=info ts=2018-09-06T11:10:16.313588161Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-09-06T11:10:16.313610447Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-09-06T11:10:16.315607053Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-09-06T11:10:18.313944578Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000297466s
level=info ts=2018-09-06T11:10:22.314297199Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=2 now=1 elapsed=6.000647448s
level=info ts=2018-09-06T11:10:30.315059802Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=14.001414171s
level=info ts=2018-09-09T18:55:23.930653016Z caller=main.go:426 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2018-09-09T18:55:25.067197275Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.2, branch=HEAD, revision=d19fae3bae451940b8470abb680cfdd59bfa7cfa)"
level=info ts=2018-09-09T18:55:25.067233709Z caller=main.go:175 build_context="(go=go1.10.3, user=root@3101e5b68a55, date=20180814-10:53:39)"
level=info ts=2018-09-09T18:55:25.128486689Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-09-09T18:55:25.128488742Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-09-09T18:55:25.131985874Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-09-09T18:55:27.128662897Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=3 elapsed=2.000096829s
level=info ts=2018-09-09T18:55:31.128969079Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=3 now=2 elapsed=6.000402722s
level=info ts=2018-09-09T18:55:33.129130021Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=3 before=2 now=1 elapsed=8.000564176s
level=info ts=2018-09-09T18:55:37.129427658Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=5 before=1 now=2 elapsed=12.000855483s
level=info ts=2018-09-09T18:55:45.130073007Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=20.001506309s

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 7
Comments: 61 (23 by maintainers)

Most upvoted comments

FWIW I’ve been following this as I’ve been having the same issues.

I have UDP and TCP open, I made sure that UDP and TCP were connectable with Ncat.

Today I added the flags: –cluster.listen-address= –cluster.advertise-address=

This has fixed the issue from what I can see.

PNRxA on Jun 25, 2019

i also have this problem ,i don’t know how to reslove it

gjpei2 on Apr 26, 2019

This is still an issue in 2019, can you let me know how to access the payloads?

tanji on Mar 5, 2019

Same issue here as well, running 2 alertmanagers containers in Docker Swarm. I’m attempting to set them up in HA, however advertise-address is unknown until the container starts up. Any movement on the PR mentioned above?

I’m running alertmanager in Docker Swarm, the two alertmanagers communicate over a dedicated Overlay network (no restrictions on the overlay network, all ports open). However, they both also have an additional network attached to expose them through my load balancer.

As per earlier posts, it’s impossible to know the IP when AM starts up, so I have “–cluster.advertise-address=:9094” set instead, which seemingly results in alertmanager using the 172.x (docker host range) instead, rather than the IP’s of the overlay network. This leads to (at least) one of the AM’s continuously flapping, resulting in the duplicate alerts.

Is there a good solution to deal with this? Or is the current workaround to fetch the IP during startup and setting it as advertise-address? Or pinning them to a host, creating a port mapping for TCP and UDP and using the host address?

darkl0rd on Oct 31, 2021

When using docker it’s important to:

pass --cluster.advertise-address so the cluster nodes know their real IP
expose port 9094 via both tcp and udp.

When using docker-compose this can be done like this:

…
    ports:
      - 9094:9094/tcp
      - 9094:9094/udp
…

Doing both fixed the issue with duplicate notifications for me.

pascal-hofmann on Apr 8, 2020

As noted in the README, both UDP and TCP ports need to be open for HA mode:

https://github.com/prometheus/alertmanager#high-availability

From looking at the deployment, I only see a TCP endpoint being opened. If you open a UDP port and configure the AMs with this, the duplicate messages should go away.

For further support, please write to the users mailling list, prometheus-users@googlegroups.com, since this seems to be a usage question and not a bug.

stuartnelson3 on Jun 24, 2019

That’s not necessary. Your group wait and group interval values are too small. I suggest you follow up on the mailing list for support help on what would be better values for your alertmanager config.

stuartnelson3 on Apr 16, 2019