prometheus-operator: Rule evaluation failures for alertmanager.rules

Two of the kube-prometheus rules from alertmanager.rules fail to evaluate: AlertmanagerConfigInconsistent and AlertmanagerDownOrMissing.

The error message is many-to-many matching not allowed: matching labels must be unique on one side.

Here are the full log entries:

level=warn
ts=2018-05-08T19:09:30.012163531Z
caller=manager.go:339
component="rule manager"
group=alertmanager.rules
msg="Evaluating rule failed"
rule="alert: AlertmanagerConfigInconsistent\nexpr: count_values by(service) (\"config_hash\", alertmanager_config_hash) / on(service)\n  group_left() label_replace(prometheus_operator_alertmanager_spec_replicas, \"service\",\n  \"alertmanager-$1\", \"alertmanager\", \"(.*)\") != 1\nfor: 5m\nlabels:\n  severity: critical\nannotations:\n  description: The configuration of the instances of the Alertmanager cluster `{{$labels.service}}`\n    are out of sync.\n  summary: Configuration out of sync\n" err="many-to-many matching not allowed: matching labels must be unique on one side"

and:

level=warn
ts=2018-05-08T19:09:30.013795976Z
caller=manager.go:339
component="rule manager"
group=alertmanager.rules
msg="Evaluating rule failed"
rule="alert: AlertmanagerDownOrMissing\nexpr: label_replace(prometheus_operator_alertmanager_spec_replicas, \"job\", \"alertmanager-$1\",\n  \"alertmanager\", \"(.*)\") / on(job) group_right() sum by(job) (up) != 1\nfor: 5m\nlabels:\n  severity: warning\nannotations:\n  description: An unexpected number of Alertmanagers are scraped or Alertmanagers\n    disappeared from discovery.\n  summary: Alertmanager down or missing\n" err="many-to-many matching not allowed: matching labels must be unique on one side"

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 17 (12 by maintainers)

Most upvoted comments

Hello, I recently switched from using the stable/prometheus-operator helm chart to the prometheus-community/charts/kube-prometheus-stack and I am now seeing this problem as well. Its not really complaining about alertmanager, rather something about kubelet.

level=warn ts=2020-09-24T20:31:49.382Z caller=manager.go:577 component="rule manager" group=kubernetes-system-kubelet msg="Evaluating rule failed" rule="alert: KubeletPodStartUpLatencyHigh\nexpr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m])))\n  * on(instance) group_left(node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"}\n  > 60\nfor: 15m\nlabels:\n  severity: warning\nannotations:\n  message: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on\n    node {{ $labels.node }}.\n  runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletpodstartuplatencyhigh\n" err="found duplicate series for the match group {instance=\"10.31.5.101:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.31.5.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ip-10-31-5-101.ec2.internal\", service=\"monitoring-dev-kube-promet-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.31.5.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ip-10-31-5-101.ec2.internal\", service=\"prometheus-dev-kube-promet-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-09-24T20:31:51.973Z caller=manager.go:577 component="rule manager" group=kubelet.rules msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n  * on(instance) group_left(node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n  quantile: \"0.99\"\n" err="found duplicate series for the match group {instance=\"10.31.5.101:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.31.5.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ip-10-31-5-101.ec2.internal\", service=\"monitoring-dev-kube-promet-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.31.5.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ip-10-31-5-101.ec2.internal\", service=\"prometheus-dev-kube-promet-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-09-24T20:31:51.978Z caller=manager.go:577 component="rule manager" group=kubelet.rules msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.9, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n  * on(instance) group_left(node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n  quantile: \"0.9\"\n" err="found duplicate series for the match group {instance=\"10.31.5.101:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.31.5.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ip-10-31-5-101.ec2.internal\", service=\"monitoring-dev-kube-promet-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.31.5.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ip-10-31-5-101.ec2.internal\", service=\"prometheus-dev-kube-promet-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-09-24T20:31:51.982Z caller=manager.go:577 component="rule manager" group=kubelet.rules msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.5, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n  * on(instance) group_left(node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n  quantile: \"0.5\"\n" err="found duplicate series for the match group {instance=\"10.31.5.101:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.31.5.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ip-10-31-5-101.ec2.internal\", service=\"monitoring-dev-kube-promet-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.31.5.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ip-10-31-5-101.ec2.internal\", service=\"prometheus-dev-kube-promet-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"

The query that is being used for the alert that’s firing is: alertname=“PrometheusRuleFailures”

increase(prometheus_rule_evaluation_failures_total{job="monitoring-dev-kube-promet-prometheus",namespace="monitoring"}[5m]) > 0

Edit Update I noticed that i had an extra kubelet service still left around from the old chart??

# kubectl -n kube-system get services | grep kubelet
NAME                                   TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                        AGE
kube-dns                               ClusterIP   172.20.0.10   <none>        53/UDP,53/TCP                  59d
monitoring-dev-kube-promet-kubelet     ClusterIP   None          <none>        10250/TCP,10255/TCP,4194/TCP   28h
monitoring-dev-prometheus-kubelet      ClusterIP   None          <none>        10250/TCP,10255/TCP,4194/TCP   2d5h

After removing this service, my problem seemed to have went away.

Hello,

Unfortunately this issue happened again, after installing the kube-prometheus-stack version 14.5.100 in rancher k3s via helm crd. Don’t know exactly if the extra service was installed at initial installation or in an upgrade installation afterwards when I changed quotas and limits.

# kubectl -n kube-system get services | grep kubelet

rancher-monitoring-kubelet   ClusterIP      None           <none>            10250/TCP,10255/TCP,4194/TCP   7d20h
rancher-monitoring-kube-pr-kubelet   ClusterIP      None           <none>            10250/TCP,10255/TCP,4194/TCP   7d20h

Which one do I have to remove? The second one with kube-pr-kubelet in the name?

I wish I knew how to fix this and submit a PR… but my PromQL is weak at best 😕 I think the issue is with the group_right() clause, but I’m not sure what the problem is exactly.