rancher: Weird etcdHighNumberOfFailedGRPCRequests prometheus rule in v2.5 rancher-monitoring

I just was trying out the v2.5 monitoring chart and noticed that I was getting a large number of alerts for etcdHighNumberOfFailedGRPCRequests for kube-etcd (alertname="etcdHighNumberOfFailedGRPCRequests", grpc_method="Watch", grpc_service="etcdserverpb.Watch", prometheus="monitoring/monitoring-rancher-monitor-prometheus", severity="critical")

I took a look at the rules and something struck me as odd. This is the expression used to determine this alert: 100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method) / sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)

Which seems to result in garbage data:

However changing BY (job, instance, grpc_service, grpc_method) with simply BY (instance) seems to produce a much more meaningful metric (% grpc failed per node)

From my perspective, it seems like it should be changed to BY (instance) as this seems to make more sense but not 100% sure. I am very new to Prometheus in general.

gz#14464 gz#14015

gz#14746

gz#15416

gz#16124 gz#16211

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 15
Comments: 33 (6 by maintainers)

Most upvoted comments

Update: upstream still has not updated their dashboards primarily because etcd has not updated the dashboards published on their website.

To get this change from the upstream chart:

A user must submit a PR to etcd-io/etcd at contrib/mixin/mixin.libsonnet - DONE
The contents of etcd-io/website at content/en/docs/v3.4/op-guide/etcd3_alert.rules.yml must be manually synced based on the above merged contents - NOT DONE
kube-prometheus-stack must run a script to update etcd dashboards - BLOCKED
Rancher must rebase to the latest upstream chart - BLOCKED

cc: @Jono-SUSE-Rancher @snasovich it may make sense to prioritize this issue as a patch on Monitoring directly since it’s been over a year since the original PR was merged in.

+13

aiyengar2 on Jul 6, 2021

Let me try to summarize the issue:

The problem of the expression comes from the grpc_code!="OK" part: it treats other non-error codes (canceled, invalidArgument, notFound, and etc) as errors so the expression reports a high rate of errors thus triggers too many alerts (see GRPC standard here).
The above problem in the expression has been fixed in the upstream etcd repo: https://github.com/etcd-io/etcd/pull/13127
But the above fix is not synced to the kube-prometheus-stack repo: notice that the wrong expression is still in the main branch

For rancher-monitoring charts, I think we can patch the expression instead of waiting for the upstream’s fix.

jiaqiluo on Nov 11, 2021

@MKlimuszka Could you please link the commit where this is fixed?

AndrewSav on May 12, 2022

So this means in the current versions these are expected “issues”…

So “best option until fix” is to…silence alerts coming from etcd?

adamrossmusic on Nov 14, 2021

The cause of the frequent etcdHighNumberOfFailedGRPCRequests alarms is not the metric expression itself.

The output of the existing expression with by:

100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m]))

does not differ from the new expression using without:

100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) without (grpc_type, grpc_code) / sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)

The high number of alerts is caused by the upstream etcd issue tracked in https://github.com/etcd-io/etcd/issues/10289, with PR https://github.com/etcd-io/etcd/pull/12196 open to address this. Until this is fixed, silencing or removing this specific alert will address the alert noise.

axeal on Jan 19, 2021

For kube-prometheus-stack user that are coming along this issue.

I put this in the helm values.yml file for a temporary fix:

defaultRules:
  disabled:
    # the results of this rule are not accurate: https://github.com/rancher/rancher/issues/29939
    etcdHighNumberOfFailedGRPCRequests: true
  
additionalPrometheusRulesMap:
  additional-rules:
    groups:
      # fix etcdHighNumberOfFailedGRPCRequests alerting rules from above, adapted from: https://github.com/etcd-io/etcd/pull/13127
      - name: etcd-additional
        rules:
          - alert: etcdHighNumberOfFailedGRPCRequestsWarning
            annotations:
              message: >-
                etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
                $labels.grpc_method }} failed on etcd instance {{ $labels.instance
                }}.
            expr: >-
              100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*",
                grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) 
              / 
              sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m]))
              > 1
            for: 10m
            labels:
              severity: warning
          - alert: etcdHighNumberOfFailedGRPCRequestsCritical
            annotations:
              message: >-
                etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
                $labels.grpc_method }} failed on etcd instance {{ $labels.instance
                }}.
            expr: >-
              100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*",
                grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) 
              / 
              sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m]))
              > 5
            for: 5m
            labels:
              severity: critical

The queries do not include the ‘without’ clauses, but the results are looking good to me:

Bildschirmfoto 2022-07-23 um 15 25 05 PM

klauserber on Jul 23, 2022

Has this been resolved? I’m seeing errors in my environment but the system seems fine.

I don’t think so. I’m running Rancher 2.5.9, fresh install of the monitoring solution, and I’m seeing this alert on clusters running 3 etcd instances. I’m silencing it for now, hoping we don’t need to in the future.

DanSibbernsen on Nov 5, 2021

My cluster is healthy and fairly sable, and I also get these constantly. I eventually silenced them through the alertmanager secret (under alertmanager.yaml):

route:
...
  # temporarily silence this alert until it's fixed upstream                         
  - match:                                                                               
      alertname: etcdHighNumberOfFailedGRPCRequests                                    
    receiver: "null"
...

immanuelfodor on Jan 17, 2021

Hi @aiyengar2 yeah I realised that after digging into things a little deeper. I actually opened that issue 😃 but yes just waiting to see what happens there. Thanks for the response

gmintoco on Nov 5, 2020

Pass Verified in 2.7.0-rc8

created downstream rke1 cluster
installed rancher-monitoring
checked prometheus UI for etcdHighNumberOfFailedGRPCRequests - 0 active
manually ran query for etcdHighNumberOfFailedGRPCRequests alert, decreased check to 1 second: 0 results

no longer receiving high number of alerts for etcdHighNumberOfFailedGRPCRequests

ronhorton on Nov 2, 2022

Hi @ccittadino-ctic, maybe I misunderstand something about the issue or you linked the wrong issue? I’m not sure how prometheus-community/helm-charts/225 is related. I do however see that the PR that Alex linked, which is the root cause of this issue, has been merged for etcd v3.5.0-alpha and may be backported to v3.4.x When this is done, we should see these alerts clear up. In the meantime, a good workaround would be to silence or remove this alert and make sure you have other alerts set up to track etcd health.

Tejeev on Mar 15, 2021