prometheus: prometheus readiness probe keeps on failing

What did you do? i did deployment of prometheus 2.27.1 .

What did you expect to see? prometheus server up and running .

What did you see instead? Under which circumstances?

prometheus container keep on restaring with readiness probe failures .

  Warning  Unhealthy  5s (x63 over 9m50s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
prometheus-prometheus-operator-kube-p-prometheus-1       1/2     Running   3          12m

the readiness is configured like this

    Readiness:  http-get http://:web/-/ready delay=100s timeout=3s period=5s #success=1 #failure=1000

-> other configurations tried

  1. keep original settings
  2. increase timeout
  • Prometheus version:

2.27.1

  • Logs:
 Normal   Pulling    11m (x2 over 11m)    kubelet            Pulling image "quay.io/prometheus/prometheus:v2.27.1"
  Normal   Created    11m (x2 over 11m)    kubelet            Created container prometheus
  Normal   Started    11m (x2 over 11m)    kubelet            Started container prometheus
  Normal   Pulled     11m (x2 over 11m)    kubelet            Successfully pulled image "quay.io/prometheus/prometheus:v2.27.1"
  Warning  Unhealthy  5s (x63 over 9m50s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (9 by maintainers)

Most upvoted comments

Yes, the flags should still be applicable.

@LeviHarrison i observed a bit closer that during restarts always some queries fail and i think you are right . for example again this happened :

level=info ts=2021-08-24T13:56:57.126Z caller=query_logger.go:79 component=activeQueryTracker msg="These queries didn't finish in prometheus' last run:" queries="[{\"query\":\"sum without(cpu, mode) (rate(node_cpu_seconds_total{mode!=\\\"idle\\\",mode!=\\\"iowait\\\",mode!=\\\"steal\\\"}[5m])) / on(instance) group_left() count by(instance) (sum by(instance, cpu) (node_cpu_seconds_total))\",\"timestamp_sec\":1629813411},{\"query\":\"container_memory_rss{image!=\\\"\\\",job=\\\"kubelet\\\",metrics_path=\\\"/metrics/cadvisor\\\"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=\\\"\\\"}))\",\"timestamp_sec\":1629813405},{\"query\":\"((sum(rate(apiserver_request_duration_seconds_count{job=\\\"apiserver\\\",verb=~\\\"LIST|GET\\\"}[1d])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job=\\\"apiserver\\\",le=\\\"0.1\\\",scope=~\\\"resource|\\\",verb=~\\\"LIST|GET\\\"}[1d])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job=\\\"apiserver\\\",le=\\\"0.5\\\",scope=\\\"namespace\\\",verb=~\\\"LIST|GET\\\"}[1d])) + sum(rate(apiserver_request_duration_seconds_bucket{job=\\\"apiserver\\\",le=\\\"5\\\",scope=\\\"cluster\\\",verb=~\\\"LIST|GET\\\"}[1d])))) + sum(rate(apiserver_request_total{code=~\\\"5..\\\",job=\\\"apiserver\\\",verb=~\\\"LIST|GET\\\"}[1d]))) / sum(rate(apiserver_request_total{job=\\\"apiserver\\\",verb=~\\\"LIST|GET\\\"}[1d]))\",\"timestamp_sec\":1629813409}]"
level=info ts=2021-08-24T13:56:57.127Z caller=web.go:540 component=web msg="Start listening for connections" address=0.0.0.0:9090

now when i look at the blogpost by @brian-brazil : https://www.robustperception.io/what-queries-were-running-when-prometheus-died

then i get to know that indeed this can happen . apologies for assuming that a query cant bring pod down, earlier.

As per this blog post : https://giedrius.blog/2019/01/13/choosing-maximum-concurrent-queries-in-prometheus-smartly/ , these flags the controls the concurrent requests :

–storage.remote.read-concurrent-limit=10
–query.max-concurrency=20
The number should be picked such that it does not exceed the number of threads of execution on your (virtual) machine. Ideally, it should be a bit lower because if your machine will encounter huge queries, it is (probably) going to also use the CPU for other operations such as sending the packets over a network.

is this flag still applicable and assumption with respect to concurrent requests are still valid ? i am using prometheus 2.27.1