prometheus-operator: Rule evaluation failures for alertmanager.rules
Two of the kube-prometheus
rules from alertmanager.rules
fail to evaluate: AlertmanagerConfigInconsistent
and AlertmanagerDownOrMissing
.
The error message is many-to-many matching not allowed: matching labels must be unique on one side
.
Here are the full log entries:
level=warn
ts=2018-05-08T19:09:30.012163531Z
caller=manager.go:339
component="rule manager"
group=alertmanager.rules
msg="Evaluating rule failed"
rule="alert: AlertmanagerConfigInconsistent\nexpr: count_values by(service) (\"config_hash\", alertmanager_config_hash) / on(service)\n group_left() label_replace(prometheus_operator_alertmanager_spec_replicas, \"service\",\n \"alertmanager-$1\", \"alertmanager\", \"(.*)\") != 1\nfor: 5m\nlabels:\n severity: critical\nannotations:\n description: The configuration of the instances of the Alertmanager cluster `{{$labels.service}}`\n are out of sync.\n summary: Configuration out of sync\n" err="many-to-many matching not allowed: matching labels must be unique on one side"
and:
level=warn
ts=2018-05-08T19:09:30.013795976Z
caller=manager.go:339
component="rule manager"
group=alertmanager.rules
msg="Evaluating rule failed"
rule="alert: AlertmanagerDownOrMissing\nexpr: label_replace(prometheus_operator_alertmanager_spec_replicas, \"job\", \"alertmanager-$1\",\n \"alertmanager\", \"(.*)\") / on(job) group_right() sum by(job) (up) != 1\nfor: 5m\nlabels:\n severity: warning\nannotations:\n description: An unexpected number of Alertmanagers are scraped or Alertmanagers\n disappeared from discovery.\n summary: Alertmanager down or missing\n" err="many-to-many matching not allowed: matching labels must be unique on one side"
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 17 (12 by maintainers)
Hello, I recently switched from using the stable/prometheus-operator helm chart to the prometheus-community/charts/kube-prometheus-stack and I am now seeing this problem as well. Its not really complaining about alertmanager, rather something about kubelet.
The query that is being used for the alert that’s firing is: alertname=“PrometheusRuleFailures”
Edit Update I noticed that i had an extra kubelet service still left around from the old chart??
After removing this service, my problem seemed to have went away.
Hello,
Unfortunately this issue happened again, after installing the kube-prometheus-stack version 14.5.100 in rancher k3s via helm crd. Don’t know exactly if the extra service was installed at initial installation or in an upgrade installation afterwards when I changed quotas and limits.
Which one do I have to remove? The second one with kube-pr-kubelet in the name?
I wish I knew how to fix this and submit a PR… but my PromQL is weak at best 😕 I think the issue is with the
group_right()
clause, but I’m not sure what the problem is exactly.