prometheus-operator: prometheusrules: Malformed alert rule halts prometheus-operator rules processing

What did you do?

Created a prometheusrule with a malformed alert rule.

What did you expect to see?

Prometheus ignore the bad rule, provide the name of the failed prometheusrule and continue it’s operation.

What did you see instead? Under which circumstances?

Alert rules processing stopped.

Environment

prom operator running with:

-manage-crds=false

  • Prometheus Operator version:

    v0.23.2

  • Kubernetes version information:

        Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.9+coreos.1", GitCommit:"cd373fe93e046b0a0bc7e4045af1bf4171cea395", GitTreeState:"clean", BuildDate:"2018-03-13T21:28:21Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
         ````
    
    
  • Kubernetes cluster kind:

    tectonic-installer

  • Manifests:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  clusterName: ""
  creationTimestamp: 2018-08-27T22:16:22Z
  deletionGracePeriodSeconds: null
  deletionTimestamp: null
  initializers: null
  labels:
    app: prometheus
    chart: prometheus-3.0.7
    heritage: Tiller
    prometheus: prometheus
    release: prometheus
    role: alert-rules
  name: prometheus-prometheus-rules
  namespace: foo
  resourceVersion: "134728661"
spec: |-
  ALERT high_load IF (1 - (avg by (instance, cpu, environment) (irate(node_cpu{mode="idle"}[1m])))) > 0.5 FOR 5m ANNOTATIONS { summary = "Instance {{ $labels.instance }} under high load", description = "{{ $labels.instance }} in {{$labels.environment}} is under high load.", }
  ALERT error_rate IF (1 - (avg by (instance, cpu, environment) (irate(controller_request_error_count[1m])))) > 200 FOR 5m ANNOTATIONS { summary = "Instance {{ $labels.instance }} high error rate count", description = "{{ $labels.instance }} in {{$labels.environment}} is have high error rate count.", }
  ALERT latency IF (avg by (instance, cpu, environment) (irate(http_request_duration_microseconds[1m]))) > 500 FOR 5m ANNOTATIONS { summary = "Instance {{ $labels.instance }} high error rate count", description = "{{ $labels.instance }} in {{$labels.environment}} is have high error rate count.", }

In the case above:

  • Prometheus operator logs:
Failed to list *v1.PrometheusRule: json: cannot unmarshal string into Go struct field PrometheusRule.spec of type v1.PrometheusRuleSpec

Another case:

A valid alert spec causes validation error:

    groups:
    - name: general.rules
      rules:
      - alert: noop
        expr: 1
        for: 1m
        labels:
          severity: noop
        annotations:
          summary: 'noop alert'
          description: 'This alert is to enable validation that alerts can be triggered.'

The above rule makes prometheus operator to fail:

  • Prometheus Operator Logs:
E0823 21:39:59.146619       1 reflector.go:205] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:345: Failed to list *v1.PrometheusRule: json: cannot unmarshal number into Go struct field Rule.expr of type string

If -manage-crds=false is set to true, since the prometheus operator tries to pick up config maps and those might also have malformed alert rules. the operator crashes with error.

ts=2018-07-11T14:21:11.558466227Z caller=main.go:175 msg="Unhandled error received. Exiting..." err="creating CRDs failed: waiting for PrometheusRule crd failed: timed out waiting for Custom Resource: failed to list CRD: json: cannot unmarshal number into Go value of type string"

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 3
  • Comments: 31 (25 by maintainers)

Most upvoted comments

Thanks for the pointer! If anyone is interested, this jq saved my day today: kubectl get -n monitoring prometheusrule -o json|jq '.items[]|{name: .metadata.name, spec: (.spec|map(type)? // "string") }'

There seem to be two distinct issues here:

  1. a valid rule expr: 1 was being interpreted as invalid, blocking the operator
  2. invalid rules block the operator

the first was fixed by #1845; the second is still an issue. to solve the second we still need to implement the webhook admission controller to validate rules before they are accepted by the API.

For what it’s worth the way we do it: we write alerting rules in jsonnet (which also allows templating etc.), and then generate the “normal” prometheus rule file out of it to run alert unit tests on, and then to actually apply we generate them into a PrometheusRule object.

This only solves part of the problem, since prometheus constant numeric values are (essentially) floats.

I was bitten by this original Prometheus YAML (validates fine with promtool):

 - record: cluster:required_fraction:ratio
   expr: 0.75

Some background, we run in-house clusters on physical hardware. We need to plan for machine failures, so for purposes of capacity planning, etc, we have a failure budget, but to make it easier in the rules, we express it as an “availability budget” (so for cores and RAM, we extrapolate the last few days of requests to “comfortably far into the future that we can get a purchase order signed, machines delivered, installed, built, joined to the cluster” before the projection exceeds “<total available> * cluster:required_fraction:ratio”.

For us, the immediate fix was to simply change the float 0.75 to the string “0.75”, but it’s definitely the case that making Rule.Expr an intstr.IntOrStr doesn’t completely solve the problem.

Unfortunately I don’t think that’s possible anymore, it’ll be overwritten on subsequent runs of the generator. There is a bright light though as https://github.com/coreos/prometheus-operator/pull/2551 was just opened, which I believe is the correct fix for this, as PrometheusRule objects won’t even be possible to be created 🙂 .

@mxinden / @brancz hey guys does this got fixed in any new release?

Gonna update our prometheus operator across the board and want to see if we can stop using our fork with the fixes that @dcondomitti added in PR https://github.com/coreos/prometheus-operator/pull/1871

It seems from the comments that It hasn’t been addressed.

What’s needed / How I can help to get the fix merged?

Thanks!

I agree that anything that passes promtool should be something the operator can work with. On list actions there is nothing we can do about unmarshalling errors, so for that.

I would propose that we just change the expr field from being strictly a string to intstr.IntOrString. That would solve this problem and not cause the operator to not work.