alertmanager: Duplicate notifications after upgrading to Alertmanager 0.15
After upgrading from Alertmanager 0.13 to 0.15.2 in a cluster of two members we’ve started receiving double notifications in slack. It used to work flawlessly with 0.13. Weirdly we’re receiving the 2 notifications exactly at the same time, they don’t seem to be apart by more than a couple of secs.
- System information:
Linux pmm-server 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux
Both instances using ntp.
- Alertmanager version:
alertmanager, version 0.15.2 (branch: HEAD, revision: d19fae3bae451940b8470abb680cfdd59bfa7cfa) build user: root@3101e5b68a55 build date: 20180814-10:53:39 go version: go1.10.3
Cluster status reports up:
Status Uptime: 2018-09-09T19:03:01.726517546Z Cluster Status Name: 01CPZVEFADF9GE2G9F2CTZZZQ6 Status: ready Peers: Name: 01CPZV0HDRQY5M5TW6FDS31MKS Address: <secret>:9094 Name: 01CPZVEFADF9GE2G9F2CTZZZQ6 Address: <secret>:9094
- Prometheus version:
Irrelevant
- Alertmanager configuration file:
global:
resolve_timeout: 5m
http_config: {}
smtp_hello: localhost
smtp_require_tls: true
slack_api_url: <secret>
pagerduty_url: https://events.pagerduty.com/v2/enqueue
hipchat_api_url: https://api.hipchat.com/
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
receiver: ops-slack
group_by:
- alertname
- group
routes:
- receiver: ops-pager
match:
alertname: MySQLDown
- receiver: ops-pager
match:
offline: critical
continue: true
routes:
- receiver: ops-slack
match:
offline: critical
- receiver: ops-pager
match:
pager: "yes"
group_wait: 30s
group_interval: 5m
repeat_interval: 30m
receivers:
- name: ops-slack
slack_configs:
- send_resolved: true
http_config: {}
api_url: <secret>
channel: alerts
username: prometheus
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
| len }}{{ end }}] {{ .CommonAnnotations.summary }}'
title_link: '{{ template "slack.default.titlelink" . }}'
pretext: '{{ template "slack.default.pretext" . }}'
text: |-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - *{{ .Labels.severity | toUpper }}* on {{ .Labels.instance }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
footer: '{{ template "slack.default.footer" . }}'
fallback: '{{ template "slack.default.fallback" . }}'
icon_emoji: '{{ template "slack.default.iconemoji" . }}'
icon_url: http://cdn.rancher.com/wp-content/uploads/2015/05/27094511/prometheus-logo-square.png
- Logs: no errors to speak of
level=info ts=2018-09-06T11:10:16.242620478Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.2, branch=HEAD, revision=d19fae3bae451940b8470abb680cfdd59bfa7cfa)"
level=info ts=2018-09-06T11:10:16.242654842Z caller=main.go:175 build_context="(go=go1.10.3, user=root@3101e5b68a55, date=20180814-10:53:39)"
level=info ts=2018-09-06T11:10:16.313588161Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-09-06T11:10:16.313610447Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-09-06T11:10:16.315607053Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-09-06T11:10:18.313944578Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000297466s
level=info ts=2018-09-06T11:10:22.314297199Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=2 now=1 elapsed=6.000647448s
level=info ts=2018-09-06T11:10:30.315059802Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=14.001414171s
level=info ts=2018-09-09T18:55:23.930653016Z caller=main.go:426 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2018-09-09T18:55:25.067197275Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.2, branch=HEAD, revision=d19fae3bae451940b8470abb680cfdd59bfa7cfa)"
level=info ts=2018-09-09T18:55:25.067233709Z caller=main.go:175 build_context="(go=go1.10.3, user=root@3101e5b68a55, date=20180814-10:53:39)"
level=info ts=2018-09-09T18:55:25.128486689Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-09-09T18:55:25.128488742Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-09-09T18:55:25.131985874Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-09-09T18:55:27.128662897Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=3 elapsed=2.000096829s
level=info ts=2018-09-09T18:55:31.128969079Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=3 now=2 elapsed=6.000402722s
level=info ts=2018-09-09T18:55:33.129130021Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=3 before=2 now=1 elapsed=8.000564176s
level=info ts=2018-09-09T18:55:37.129427658Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=5 before=1 now=2 elapsed=12.000855483s
level=info ts=2018-09-09T18:55:45.130073007Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=20.001506309s
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 7
- Comments: 61 (23 by maintainers)
FWIW I’ve been following this as I’ve been having the same issues.
I have UDP and TCP open, I made sure that UDP and TCP were connectable with Ncat.
Today I added the flags: –cluster.listen-address= –cluster.advertise-address=
This has fixed the issue from what I can see.
i also have this problem ,i don’t know how to reslove it
This is still an issue in 2019, can you let me know how to access the payloads?
I’m running alertmanager in Docker Swarm, the two alertmanagers communicate over a dedicated Overlay network (no restrictions on the overlay network, all ports open). However, they both also have an additional network attached to expose them through my load balancer.
As per earlier posts, it’s impossible to know the IP when AM starts up, so I have “–cluster.advertise-address=:9094” set instead, which seemingly results in alertmanager using the 172.x (docker host range) instead, rather than the IP’s of the overlay network. This leads to (at least) one of the AM’s continuously flapping, resulting in the duplicate alerts.
Is there a good solution to deal with this? Or is the current workaround to fetch the IP during startup and setting it as advertise-address? Or pinning them to a host, creating a port mapping for TCP and UDP and using the host address?
When using docker it’s important to:
--cluster.advertise-address
so the cluster nodes know their real IPtcp
andudp
.When using
docker-compose
this can be done like this:Doing both fixed the issue with duplicate notifications for me.
As noted in the README, both UDP and TCP ports need to be open for HA mode:
https://github.com/prometheus/alertmanager#high-availability
From looking at the deployment, I only see a TCP endpoint being opened. If you open a UDP port and configure the AMs with this, the duplicate messages should go away.
For further support, please write to the users mailling list, prometheus-users@googlegroups.com, since this seems to be a usage question and not a bug.
That’s not necessary. Your group wait and group interval values are too small. I suggest you follow up on the mailing list for support help on what would be better values for your alertmanager config.