prometheus: Bad data error posting alerts to alertmanager
What did you do? Ran prom/prometheus:master with prom/alertmanager:v0.11.0. I’m using master instead of v2.0.0 because I need to pick up the fixes in prometheus/tsdb#213 and #3508.
What did you expect to see? Alerts posted to alertmanager from prometheus work as expected
What did you see instead? Under which circumstances? In prometheus logs:
level=error ts=2017-12-04T15:57:01.4290217Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"
In alertmanager logs:
level=error ts=2017-12-04T15:57:01.42907689Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"
Environment Kubernetes 1.8.2
- System information:
/prometheus $ uname -srm
Linux 4.4.0-101-generic x86_64
- Prometheus version:
/prometheus $ prometheus --version
prometheus, version 2.0.0 (branch: master, revision: 30b4439bbd795fee4703b8f5015da6bf13fee0ac)
build user: root@69018d7b60eb
build date: 20171204-11:46:26
go version: go1.9.2
- Alertmanager version:
alertmanager, version 0.11.0 (branch: HEAD, revision: 30dd0426c08b6479d9a26259ea5efd63bc1ee273)
build user: root@3e103e3fc918
build date: 20171116-17:43:56
go version: go1.9.2
- Prometheus configuration file:
global:
scrape_interval: 30s
scrape_timeout: 10s
evaluation_interval: 30s
external_labels:
core_env: kube
alerting:
alertmanagers:
- kubernetes_sd_configs:
- api_server: null
role: pod
namespaces:
names: []
scheme: http
timeout: 10s
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
separator: ;
regex: alertmanager
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: default
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
separator: ;
regex: null
replacement: $1
action: drop
rule_files:
- /config/prometheus-2/rules
scrape_configs:
- job_name: prometheus
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- localhost:9091
- job_name: job-N
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: go_.+
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: http_.+
replacement: $1
action: drop
static_configs:
- targets:
- <snip>
labels:
<snip>
- targets:
- <snip>
labels:
<snip>
.
.
.
- Alertmanager configuration file:
global:
resolve_timeout: 1h
slack_api_url: "{{ inquisition_slack_url }}"
route:
group_by: ['alertname']
group_wait: 2s
group_interval: 10s
receiver: slack
# If an alert has been sent, then wait "repeat_interval" time before resending it
repeat_interval: 1h
routes:
- match:
destination: inq-enforcer
receiver: inq-enforcer
repeat_interval: 10m
group_by: ['instance','alertname','hostname','privateip','publicip']
group_wait: 2s
group_interval: 10s
- match:
destination: inq-town-crier
receiver: inq-town-crier
repeat_interval: 10m
group_by: ['instance','alertname','hostname','privateip','publicip']
group_wait: 2s
group_interval: 10s
- match:
destination: slack
receiver: slack
group_by: ['instance','alertname','hostname','privateip','publicip']
group_wait: 2s
group_interval: 10s
- match:
destination: slack-summarise
receiver: slack-summarise
group_by: ['alertname']
group_wait: 1m
group_interval: 10m
repeat_interval: 30m
- match:
destination: inq-overwatch
receiver: inq-overwatch
- match:
destination: pagerduty
receiver: pagerduty
group_by: ['instance','alertname','hostname','privateip','publicip']
receivers:
- name: inq-enforcer
webhook_configs:
- url: '<URL>'
- name: inq-town-crier
webhook_configs:
- url: '<URL>'
- name: slack
slack_configs:
- channel: "#{{ notification_channel }}"
title: "[AlertManager] {{ '{{' }} .CommonAnnotations.summary {{ '}}' }}"
text: "{{ '{{' }} .CommonLabels.alertname {{ '}}' }} - {{ '{{' }} .CommonAnnotations.description {{ '}}' }}"
- name: slack-summarise
slack_configs:
- channel: "#{{ notification_channel }}"
title: "[AlertManager] {{ '{{' }} len .Alerts {{ '}}' }} {{ '{{' }} .CommonAnnotations.grouped_summary {{ '}}' }}"
text: "{{ '{{' }} .CommonLabels.alertname {{ '}}' }} - {{ '{{' }} len .Alerts {{ '}}' }} {{ '{{' }} .CommonAnnotations.grouped_description {{ '}}' }}"
- name: inq-overwatch
slack_configs:
- channel: "#{{ inquisition_notification_channel }}"
title: "{{ '{{' }} .CommonAnnotations.summary {{ '}}' }}"
text: "{{ '{{' }} .CommonAnnotations.description {{ '}}' }}"
webhook_configs:
- url: '<URL>'
- name: pagerduty
pagerduty_configs:
- service_key: "{{ inquisition_monitoring_pagerduty_key }}"
- Logs: Prom:
level=info ts=2017-12-04T15:53:31.333931883Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.0.0, branch=master, revision=36190239f6d965f5e009da2fae8b502c09399df5)"
level=info ts=2017-12-04T15:53:31.334009371Z caller=main.go:221 build_context="(go=go1.9.2, user=root@9d041e93ac17, date=20171130-13:36:53)"
level=info ts=2017-12-04T15:53:31.334034568Z caller=main.go:222 host_details="(Linux 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 prometheus-2-996bbbd66-4h47w (none))"
level=info ts=2017-12-04T15:53:31.338234558Z caller=main.go:398 msg="Starting TSDB ..."
level=info ts=2017-12-04T15:53:31.338073315Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=info ts=2017-12-04T15:53:31.338340621Z caller=web.go:380 component=web msg="Start listening for connections" address=:9091
level=info ts=2017-12-04T15:54:11.407839167Z caller=main.go:408 msg="TSDB started"
level=info ts=2017-12-04T15:54:11.407932743Z caller=main.go:490 msg="Loading configuration file" filename=/config/prometheus-2/prometheus
level=info ts=2017-12-04T15:54:12.74356221Z caller=kubernetes.go:100 component=notifier discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-04T15:54:12.751527333Z caller=main.go:385 msg="Server is ready to receive requests."
level=error ts=2017-12-04T15:57:01.4290217Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"
level=error ts=2017-12-04T15:58:01.456900885Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"
level=error ts=2017-12-04T15:58:01.464241747Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"
Alertmanager:
.
.
.
level=info ts=2017-12-04T15:40:50.614454519Z caller=silence.go:279 component=silences msg="Maintenance done" duration=75.744µs
level=info ts=2017-12-04T15:40:50.62667297Z caller=nflog.go:304 component=nflog msg="Maintenance done" duration=99.391µs
level=info ts=2017-12-04T15:55:50.609303021Z caller=nflog.go:287 component=nflog msg="Running maintenance"
level=info ts=2017-12-04T15:55:50.609515273Z caller=silence.go:262 component=silences msg="Running maintenance"
level=info ts=2017-12-04T15:55:50.61830075Z caller=silence.go:279 component=silences msg="Maintenance done" duration=63.853µs
level=info ts=2017-12-04T15:55:50.631665188Z caller=nflog.go:304 component=nflog msg="Maintenance done" duration=85.098µs
level=error ts=2017-12-04T15:57:01.42907689Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"
level=error ts=2017-12-04T15:58:01.456887499Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"
level=error ts=2017-12-04T15:58:01.46423974Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time"
Alertmanager has been up for 3 days prior to this issue (which occurred when I updated the prometheus container) so cut the rest of the logs out for clarity. I can provide them if needed.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 9
- Comments: 19 (5 by maintainers)
This is a serious regression with 2.1, requiring pod restarts. Please prioritize, as downgrade is unsupported due to persistence.
@Conorbro What are the plans/schedule to release fix (#3724) for that bug?