prometheus: Bad data error posting alerts to alertmanager

What did you do? Ran prom/prometheus:master with prom/alertmanager:v0.11.0. I’m using master instead of v2.0.0 because I need to pick up the fixes in prometheus/tsdb#213 and #3508.

What did you expect to see? Alerts posted to alertmanager from prometheus work as expected

What did you see instead? Under which circumstances? In prometheus logs:

level=error ts=2017-12-04T15:57:01.4290217Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"

In alertmanager logs:

level=error ts=2017-12-04T15:57:01.42907689Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"

Environment Kubernetes 1.8.2

System information:

/prometheus $ uname -srm
Linux 4.4.0-101-generic x86_64

Prometheus version:

/prometheus $ prometheus --version
prometheus, version 2.0.0 (branch: master, revision: 30b4439bbd795fee4703b8f5015da6bf13fee0ac)
  build user:       root@69018d7b60eb
  build date:       20171204-11:46:26
  go version:       go1.9.2

Alertmanager version:

alertmanager, version 0.11.0 (branch: HEAD, revision: 30dd0426c08b6479d9a26259ea5efd63bc1ee273)
  build user:       root@3e103e3fc918
  build date:       20171116-17:43:56
  go version:       go1.9.2

Prometheus configuration file:

global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    core_env: kube
alerting:
  alertmanagers:
  - kubernetes_sd_configs:
    - api_server: null
      role: pod
      namespaces:
        names: []
    scheme: http
    timeout: 10s
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      separator: ;
      regex: alertmanager
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_namespace]
      separator: ;
      regex: default
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      separator: ;
      regex: null
      replacement: $1
      action: drop
rule_files:
- /config/prometheus-2/rules
scrape_configs:
- job_name: prometheus
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - localhost:9091
- job_name: job-N
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: go_.+
    replacement: $1
    action: drop
  - source_labels: [__name__]
    separator: ;
    regex: http_.+
    replacement: $1
    action: drop
  static_configs:
  - targets:
    -  <snip>
    labels:
      <snip>
  - targets:
    - <snip>
    labels:
      <snip>
.
.
.

Alertmanager configuration file:

    global:
      resolve_timeout: 1h
      slack_api_url: "{{ inquisition_slack_url }}"

    route:
      group_by: ['alertname']
      group_wait: 2s
      group_interval: 10s
      receiver: slack
      # If an alert has been sent, then wait "repeat_interval" time before resending it
      repeat_interval: 1h
      routes:
      - match:
          destination: inq-enforcer
        receiver: inq-enforcer
        repeat_interval: 10m
        group_by: ['instance','alertname','hostname','privateip','publicip']
        group_wait: 2s
        group_interval: 10s
      - match:
          destination: inq-town-crier
        receiver: inq-town-crier
        repeat_interval: 10m
        group_by: ['instance','alertname','hostname','privateip','publicip']
        group_wait: 2s
        group_interval: 10s
      - match:
          destination: slack
        receiver: slack
        group_by: ['instance','alertname','hostname','privateip','publicip']
        group_wait: 2s
        group_interval: 10s
      - match:
          destination: slack-summarise
        receiver: slack-summarise
        group_by: ['alertname']
        group_wait: 1m
        group_interval: 10m
        repeat_interval: 30m
      - match:
          destination: inq-overwatch
        receiver: inq-overwatch
      - match:
          destination: pagerduty
        receiver: pagerduty
        group_by: ['instance','alertname','hostname','privateip','publicip']

    receivers:
    - name: inq-enforcer
      webhook_configs:
        - url: '<URL>'
    - name: inq-town-crier
      webhook_configs:
        - url: '<URL>'
    - name: slack
      slack_configs:
      - channel: "#{{ notification_channel }}"
        title: "[AlertManager] {{ '{{' }} .CommonAnnotations.summary {{ '}}' }}"
        text: "{{ '{{' }} .CommonLabels.alertname {{ '}}' }} - {{ '{{' }} .CommonAnnotations.description {{ '}}' }}"
    - name: slack-summarise
      slack_configs:
      - channel: "#{{ notification_channel }}"
        title: "[AlertManager] {{ '{{' }} len .Alerts {{ '}}' }} {{ '{{' }} .CommonAnnotations.grouped_summary {{ '}}' }}"
        text: "{{ '{{' }} .CommonLabels.alertname {{ '}}' }} - {{ '{{' }} len .Alerts {{ '}}' }} {{ '{{' }} .CommonAnnotations.grouped_description {{ '}}' }}"
    - name: inq-overwatch
      slack_configs:
      - channel: "#{{ inquisition_notification_channel }}"
        title: "{{ '{{' }} .CommonAnnotations.summary {{ '}}' }}"
        text: "{{ '{{' }} .CommonAnnotations.description {{ '}}' }}"
      webhook_configs:
        - url: '<URL>'
    - name: pagerduty
      pagerduty_configs:
      - service_key: "{{ inquisition_monitoring_pagerduty_key }}"

Logs: Prom:

level=info ts=2017-12-04T15:53:31.333931883Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.0.0, branch=master, revision=36190239f6d965f5e009da2fae8b502c09399df5)"
level=info ts=2017-12-04T15:53:31.334009371Z caller=main.go:221 build_context="(go=go1.9.2, user=root@9d041e93ac17, date=20171130-13:36:53)"
level=info ts=2017-12-04T15:53:31.334034568Z caller=main.go:222 host_details="(Linux 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 prometheus-2-996bbbd66-4h47w (none))"
level=info ts=2017-12-04T15:53:31.338234558Z caller=main.go:398 msg="Starting TSDB ..."
level=info ts=2017-12-04T15:53:31.338073315Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=info ts=2017-12-04T15:53:31.338340621Z caller=web.go:380 component=web msg="Start listening for connections" address=:9091
level=info ts=2017-12-04T15:54:11.407839167Z caller=main.go:408 msg="TSDB started"
level=info ts=2017-12-04T15:54:11.407932743Z caller=main.go:490 msg="Loading configuration file" filename=/config/prometheus-2/prometheus
level=info ts=2017-12-04T15:54:12.74356221Z caller=kubernetes.go:100 component=notifier discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-04T15:54:12.751527333Z caller=main.go:385 msg="Server is ready to receive requests."
level=error ts=2017-12-04T15:57:01.4290217Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"
level=error ts=2017-12-04T15:58:01.456900885Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"
level=error ts=2017-12-04T15:58:01.464241747Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"

Alertmanager:

.
.
.
level=info ts=2017-12-04T15:40:50.614454519Z caller=silence.go:279 component=silences msg="Maintenance done" duration=75.744µs
level=info ts=2017-12-04T15:40:50.62667297Z caller=nflog.go:304 component=nflog msg="Maintenance done" duration=99.391µs
level=info ts=2017-12-04T15:55:50.609303021Z caller=nflog.go:287 component=nflog msg="Running maintenance"
level=info ts=2017-12-04T15:55:50.609515273Z caller=silence.go:262 component=silences msg="Running maintenance"
level=info ts=2017-12-04T15:55:50.61830075Z caller=silence.go:279 component=silences msg="Maintenance done" duration=63.853µs
level=info ts=2017-12-04T15:55:50.631665188Z caller=nflog.go:304 component=nflog msg="Maintenance done" duration=85.098µs
level=error ts=2017-12-04T15:57:01.42907689Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"
level=error ts=2017-12-04T15:58:01.456887499Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"
level=error ts=2017-12-04T15:58:01.46423974Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time"

Alertmanager has been up for 3 days prior to this issue (which occurred when I updated the prometheus container) so cut the rest of the logs out for clarity. I can provide them if needed.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 9
Comments: 19 (5 by maintainers)

Most upvoted comments

This is a serious regression with 2.1, requiring pod restarts. Please prioritize, as downgrade is unsupported due to persistence.

+28

solsson on Jan 23, 2018

@Conorbro What are the plans/schedule to release fix (#3724) for that bug?

jsuchenia on Feb 4, 2018