alertmanager: PagerDuty notifier fails with the error "http: server closed idle connection"
What did you do?
We have an alert rule like the following which fires off periodically, saying failing to notify PagerDuty.
- alert: AlertmanagerNotificationFailing
expr: 'rate(alertmanager_notifications_failed_total[1m]) > 0'
labels:
severity: warning
annotations:
description: "Alertmanager is failing sending notifications."
summary: "AlertManager {{ $labels.instance }} -
{{ $labels.integration }} notification failing"
After setting the log level to debug
, we found messages like below:
{"attempt":1,"caller":"notify.go:668","component":"dispatcher","err":"failed to post message to PagerDuty: Post https://events.pagerduty.com/v2/enqueue: http: server closed idle connection","integration":"pagerduty","level":"debug","msg":"Notify attempt failed","receiver":"linf","ts":"2020-08-10T19:01:54.036Z"}
So the PagerDuty notifier tries to keep connections alive and the idle timeout is 5 min as per here, but clearly PagerDuty closes out idle connections sooner than that.
Response from PagerDuty’s engineering team is
It seems like you are trying to use a persistent connection to our Events API in order to send multiple events. I’m afraid this is currently not supported by the Events API, we don’t expect to keep connections alive any longer than a single API request.
What did you expect to see?
Based on that, I am wondering if we could update this line from
client, err := commoncfg.NewClientFromConfig(*c.HTTPConfig, "pagerduty", false)
to
client, err := commoncfg.NewClientFromConfig(*c.HTTPConfig, "pagerduty", true)
so that keep-alive will be disabled, and users will not receive the above errors, and the metric alertmanager_notifications_total
will not be incremented because of it.
- Alertmanager version:
Version Information
Branch: HEAD
BuildDate: 20191211-14:13:14
BuildUser: root@00c3106655f8
GoVersion: go1.13.5
Revision: f74be0400a6243d10bb53812d6fa408ad71ff32d
Version: 0.20.0
- Alertmanager configuration file:
global:
resolve_timeout: 5m
smtp_hello: localhost
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
hipchat_api_url: https://api.hipchat.com/
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
receiver: default
group_by:
- '...'
routes:
- receiver: default
group_by:
- '...'
match:
label: value
group_wait: 0s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: default
pagerduty_configs:
- send_resolved: true
routing_key: <secret>
url: https://events.pagerduty.com/v2/enqueue
client: '{{ template "pagerduty.default.client" . }}'
client_url: '{{ template "pagerduty.default.clientURL" . }}'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
num_firing: '{{ .Alerts.Firing | len }}'
num_resolved: '{{ .Alerts.Resolved | len }}'
resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower
}}{{ else }}critical{{ end }}'
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (11 by maintainers)
Hi @roidelapluie. I am looking at this. We will decide next week whether we will support keep-alive connections going forward or respond with a
Connection: close
header.If that’s the case then Pagerduty should be sending back a
Connection: Close
header as connection persistence has been the norm for 20+ years. As-is, Pagerduty is violating the HTTP spec.