alertmanager: PagerDuty notifier fails with the error "http: server closed idle connection"

What did you do?

We have an alert rule like the following which fires off periodically, saying failing to notify PagerDuty.

    - alert: AlertmanagerNotificationFailing
      expr: 'rate(alertmanager_notifications_failed_total[1m]) > 0'
      labels:
        severity: warning
      annotations:
        description: "Alertmanager is failing sending notifications."
        summary: "AlertManager {{ $labels.instance }} - 
          {{ $labels.integration }} notification failing"

After setting the log level to debug, we found messages like below:

{"attempt":1,"caller":"notify.go:668","component":"dispatcher","err":"failed to post message to PagerDuty: Post https://events.pagerduty.com/v2/enqueue: http: server closed idle connection","integration":"pagerduty","level":"debug","msg":"Notify attempt failed","receiver":"linf","ts":"2020-08-10T19:01:54.036Z"}

So the PagerDuty notifier tries to keep connections alive and the idle timeout is 5 min as per here, but clearly PagerDuty closes out idle connections sooner than that.

Response from PagerDuty’s engineering team is

It seems like you are trying to use a persistent connection to our Events API in order to send multiple events. I’m afraid this is currently not supported by the Events API, we don’t expect to keep connections alive any longer than a single API request.

What did you expect to see?

Based on that, I am wondering if we could update this line from

	client, err := commoncfg.NewClientFromConfig(*c.HTTPConfig, "pagerduty", false)

	client, err := commoncfg.NewClientFromConfig(*c.HTTPConfig, "pagerduty", true)

so that keep-alive will be disabled, and users will not receive the above errors, and the metric alertmanager_notifications_total will not be incremented because of it.

Alertmanager version:

Version Information
Branch: HEAD
BuildDate: 20191211-14:13:14
BuildUser: root@00c3106655f8
GoVersion: go1.13.5
Revision: f74be0400a6243d10bb53812d6fa408ad71ff32d
Version: 0.20.0

Alertmanager configuration file:

global:
  resolve_timeout: 5m
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  hipchat_api_url: https://api.hipchat.com/
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: default
  group_by:
  - '...'
  routes:
  - receiver: default
    group_by:
    - '...'
    match:
      label: value
  group_wait: 0s
  group_interval: 5m
  repeat_interval: 4h
receivers:
- name: default
  pagerduty_configs:
  - send_resolved: true
    routing_key: <secret>
    url: https://events.pagerduty.com/v2/enqueue
    client: '{{ template "pagerduty.default.client" . }}'
    client_url: '{{ template "pagerduty.default.clientURL" . }}'
    description: '{{ .CommonAnnotations.summary }}'
    details:
      firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
      num_firing: '{{ .Alerts.Firing | len }}'
      num_resolved: '{{ .Alerts.Resolved | len }}'
      resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
    severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower
      }}{{ else }}critical{{ end }}'

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (11 by maintainers)

Most upvoted comments

Hi @roidelapluie. I am looking at this. We will decide next week whether we will support keep-alive connections going forward or respond with a Connection: close header.

shamilpd on Aug 15, 2020

I’m afraid this is currently not supported by the Events API, we don’t expect to keep connections alive any longer than a single API request.

If that’s the case then Pagerduty should be sending back a Connection: Close header as connection persistence has been the norm for 20+ years. As-is, Pagerduty is violating the HTTP spec.

brian-brazil on Aug 13, 2020