alertmanager: Alert grouping in PagerDuty is not very useful
When triggering multiple alerts that get grouped together, the result in PagerDuty winds up being like this:
Only the alert summary gets updated, and subsequent messages get logged. The alert details are not updated (it still reflects the original message and summary). You have to click through the detailed log to see each individual API message sent from Alertmanager.
Ideally I’d want grouped alerts to show up individually in PagerDuty, and then be associated with the same incident, but AIUI there isn’t an API for this.
The main issue here is that as further alerts are grouped together, new notifications are not triggered. This is dangerous, because it means that alerts can go unnoticed quite silently (until the operator manually polls Prometheus and/or Alertmanager). Basically, prometheus is “editing” an existing alert instead of creating new ones. The only workaround is to disable grouping altogether, but then this invalidates the grouping of simultaneous alerts, which means alert storms when several things break due to one root cause.
I’m honestly not sure what the right solution here is, but maybe something like generate new Alerts per group with a new dedup_key
whenever additional alerts are grouped together, and try to do something smart about resolution events, like track when sent events have each of their root alerts resolved and send the resolution event then? With the current behavior, the only clean setup I can see is to just disable alert grouping altogether.
- Alertmanager version:
alertmanager, version 0.15.2 (branch: HEAD, revision: d19fae3bae451940b8470abb680cfdd59bfa7cfa)
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 9
- Comments: 26 (12 by maintainers)
Hi Sylvain, thanks for directing us to the thread ongoing in Github!
When reviewing what different customers are desiring with this integration, our Product team noted 3 common requests:
When triggering multiple alerts via the API and those alerts get grouped, only the alert summary gets updated, and subsequent messages get logged. The alert details are not updated (it still reflects the original message and summary). At this time, you have to click through the detailed log to see each individual API message sent from AlertManager. Ideally you would want grouped alerts to show up individually in PagerDuty, and then be associated with the same incident, but AIUI there isn’t an API for this.
There is a feature request to implement “one-shot” alert grouping (wait, collect alerts, notify, start from scratch for any further alerts).
It’s tough to visually distinguish updates to an incident (i.e. if that have alerts grouped to them or when the last alert was added to the incident, etc).
Regarding point one and two, our Product teams responsible for Alerts and Events are currently in a planning cycle right now and they are reviewing these two requests.
Regarding point three, this is on our radar and we plan to make an improvement to this experience soon.
If you have any additional feedback you would like to provide that wasn’t shared on this thread, we would love to hear you out! You can always add context to this thread or reach out to PagerDuty Support with your particular use case for us to forward to the Product team.
Cheers!
I opened a support case at @PagerDuty pointing to this issue.
Users might see it the other way round: “Once I have ack’d the page, I don’t want to get re-paged on the same alert group. It’s all the same issue anyway, and I told PD I’m working on it.”
If you wont to get re-paged, you could simply Resolve a PD incident rather than merely Ack 'ing it. That should result in a new page as soon as Alertmanager sends an update to the alert.
In different news: I talked to a PD engineer long ago about the problem that all those additional updates to an alert are hidden quite deeply in the alert. They told me they would think about improving that, but apparently that hasn’t happened.
@shibumi it’s possible to use
group_by: [...]
to disable grouping.Side-note: please use the mailing list or IRC for such questions rather than commenting an existing issue…
Personally, I think the AM configuration options are already very confusing. I often get caught in them myself. Adding yet another way of managing groups (like the one you suggest) would confuse everybody even more. Also, I do think it will lead to “group spam” with the right pattern, e.g. imagine a slow death of a 1000-instance microservice where you essentially create updates to a group of “instance down” alerts every couple of seconds over many minutes. You would create many, many alert groups not only in PD but also in AM. And a page every
group_wait
, i.e. a page every 30s.Fundamentally, we are working against a lack of feature in PD. It would be very easy for them to represent updates to an incident in a more visible form, and that’s what I discussed with them back then. Perhaps you could file a feature request for them (we all pay them, after all, in contrast to the AM developers). Also, it would be nice if users of PD could configure if they want to get re-notified on updates of an incident. In this way, every users could decide what their preferred paging pattern is.
That seems kind of a hack, in that it negates any incident management you might want to do in PD (basically you wind up using it only for escalation and notification), and there’s also a race condition there. I mean, it works, but wouldn’t what I suggested make more sense?