prometheus: prometheus readiness probe keeps on failing

What did you do? i did deployment of prometheus 2.27.1 .

What did you expect to see? prometheus server up and running .

What did you see instead? Under which circumstances?

prometheus container keep on restaring with readiness probe failures .

  Warning  Unhealthy  5s (x63 over 9m50s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

prometheus-prometheus-operator-kube-p-prometheus-1       1/2     Running   3          12m

the readiness is configured like this

    Readiness:  http-get http://:web/-/ready delay=100s timeout=3s period=5s #success=1 #failure=1000

-> other configurations tried

keep original settings
increase timeout

Prometheus version:

2.27.1

Logs:

 Normal   Pulling    11m (x2 over 11m)    kubelet            Pulling image "quay.io/prometheus/prometheus:v2.27.1"
  Normal   Created    11m (x2 over 11m)    kubelet            Created container prometheus
  Normal   Started    11m (x2 over 11m)    kubelet            Started container prometheus
  Normal   Pulled     11m (x2 over 11m)    kubelet            Successfully pulled image "quay.io/prometheus/prometheus:v2.27.1"
  Warning  Unhealthy  5s (x63 over 9m50s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 22 (9 by maintainers)

Most upvoted comments

Yes, the flags should still be applicable.

roidelapluie on Aug 24, 2021

@LeviHarrison i observed a bit closer that during restarts always some queries fail and i think you are right . for example again this happened :

level=info ts=2021-08-24T13:56:57.126Z caller=query_logger.go:79 component=activeQueryTracker msg="These queries didn't finish in prometheus' last run:" queries="[{\"query\":\"sum without(cpu, mode) (rate(node_cpu_seconds_total{mode!=\\\"idle\\\",mode!=\\\"iowait\\\",mode!=\\\"steal\\\"}[5m])) / on(instance) group_left() count by(instance) (sum by(instance, cpu) (node_cpu_seconds_total))\",\"timestamp_sec\":1629813411},{\"query\":\"container_memory_rss{image!=\\\"\\\",job=\\\"kubelet\\\",metrics_path=\\\"/metrics/cadvisor\\\"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=\\\"\\\"}))\",\"timestamp_sec\":1629813405},{\"query\":\"((sum(rate(apiserver_request_duration_seconds_count{job=\\\"apiserver\\\",verb=~\\\"LIST|GET\\\"}[1d])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job=\\\"apiserver\\\",le=\\\"0.1\\\",scope=~\\\"resource|\\\",verb=~\\\"LIST|GET\\\"}[1d])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job=\\\"apiserver\\\",le=\\\"0.5\\\",scope=\\\"namespace\\\",verb=~\\\"LIST|GET\\\"}[1d])) + sum(rate(apiserver_request_duration_seconds_bucket{job=\\\"apiserver\\\",le=\\\"5\\\",scope=\\\"cluster\\\",verb=~\\\"LIST|GET\\\"}[1d])))) + sum(rate(apiserver_request_total{code=~\\\"5..\\\",job=\\\"apiserver\\\",verb=~\\\"LIST|GET\\\"}[1d]))) / sum(rate(apiserver_request_total{job=\\\"apiserver\\\",verb=~\\\"LIST|GET\\\"}[1d]))\",\"timestamp_sec\":1629813409}]"
level=info ts=2021-08-24T13:56:57.127Z caller=web.go:540 component=web msg="Start listening for connections" address=0.0.0.0:9090

now when i look at the blogpost by @brian-brazil : https://www.robustperception.io/what-queries-were-running-when-prometheus-died

then i get to know that indeed this can happen . apologies for assuming that a query cant bring pod down, earlier.

As per this blog post : https://giedrius.blog/2019/01/13/choosing-maximum-concurrent-queries-in-prometheus-smartly/ , these flags the controls the concurrent requests :

–storage.remote.read-concurrent-limit=10
–query.max-concurrency=20

The number should be picked such that it does not exceed the number of threads of execution on your (virtual) machine. Ideally, it should be a bit lower because if your machine will encounter huge queries, it is (probably) going to also use the CPU for other operations such as sending the packets over a network.

is this flag still applicable and assumption with respect to concurrent requests are still valid ? i am using prometheus 2.27.1

navneet1075 on Aug 24, 2021