prometheus-operator: Alert K8SApiServerLatency triggering

The alert K8SApiServerLatency is strangely triggering to all my clusters. I checked my apiservers and etcd instances and looks they’re smoothly running without latency or load issue.

Alert

ALERT K8SApiServerLatency
  IF histogram_quantile(0.99, sum(apiserver_request_latencies_bucket
{verb!~"CONNECT|WATCHLIST|WATCH"}) 
WITHOUT (instance, node,resource)) / 1000000 > 1
  FOR 10m
  LABELS {service="k8s", severity="warning"}
  ANNOTATIONS {description="99th percentile Latency 
for {{ $labels.verb }} requests to the kube-apiserver is higher
 than 1s.", summary="Kubernetes apiserver latency is high"}

Value > 1 are triggering

screen shot 2017-05-05 at 11 01 11

What’s wrong?

Analysing this alert it’s not comparing verb WATCHLIST but it is for verb LIST. As we can see in the graph below the latency for verb WATCHLIST is exactly the same for LIST. I’m here to heard if it’s something wrong on my environment or if other users has the same behaviour, I have the same values in 2 clusters.

Latency with all verbs

screen shot 2017-05-05 at 11 00 25

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 20 (12 by maintainers)

Most upvoted comments

A naive question - isn’t apiserver_request_latencies_bucketa counter? if so shouldn’t it be sum(rate(apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH"})) ...?

BTW 👍 on excluding list. seeing this on production cluster with 300 pods for get pods.

mindw on Oct 27, 2017

@pete0emerson in your case it seems to be the log resource endpoint, which can naturally take a long time, so we should just ignore that one, especially as it also allows following/streaming so it can actually be infinitely long.

@slintes the Kubernetes components themselves are doing requests to the apiserver, so those latencies are not only coming from kubectl alone. Could you post the ones that are firing for you? It’s likely that they just have to be ignored, like the logs resource, as we can’t tell by the latency whether something is wrong.

Alerts require continuous improvement, and as a community we’re able to catch a lot more cases, so all suggestions/PRs welcome! 🙂

brancz on Sep 4, 2017