thanos: Query via Thanos causes Prometheus to OOM

Thanos, Prometheus and Golang version used

Thanos components: improbable/thanos:master-2018-07-27-ecfce89
Prometheus: prom/prometheus:v2.3.2

What happened

The query provided below timeouts (after the default 2 minutes query timeout), but before it does that, Prometheus gets OOM.

What you expected to happen

Either the query to complete successfully, or timeout (assuming it’s too complex) without bringing Prometheus down.

How to reproduce it (as minimally and precisely as possible):

We are using the following command to perform the query:

curl "https://thanosquery/api/v1/query_range?query=label_replace(%0A%20%20histogram_quantile(%0A%20%20%20%200.999%2C%0A%20%20%20%20sum(%20%20%20%20%20%20rate(join_joined_left_vs_right_timestamp_diff_bucket%7Bpartition%3D~%220%7C1%7C2%7C3%7C4%7C5%7C6%7C7%7C8%7C18%7C19%7C20%7C21%7C22%7C23%7C24%7C25%7C26%7C27%7C28%7C29%7C30%7C31%7C32%7C33%7C34%7C35%22%2Cmarathon_app%3D~%22%2Fsome%2Fapp%2Fregion%2Fjoin.*%22%7D%5B5m%5D)%0A%20%20%20%20)%20by%20(le%2C%20marathon_app%2C%20partition)%0A%20%20)%2C%20%0A%20%20%22region%22%2C%20%22%241%22%2C%20%22marathon_app%22%2C%20%22%2Fsome%2Fapp%2F(.*)%2Fjoin.*%22%0A)&start=1532343000&end=1532948400&step=600"

Running this once or twice, always brings our nodes down. If we run the same query directly against Prometheus (not via Thanos Query), it completes quite fast successfully (we store 24h data on Prometheus).

Anything else we need to know

We are running independent Docker hosts with the following containers:

Prometheus
Thanos Query
Thanos Store
Thanos Sidecar

Thanos Compactor is running independently on a different host. We also try running Thanos Store on a different host than other containers.

One thing to note, that if we have multiple (let’s say have 4 nodes) this configuration and put Thanos Query under an Nginx-based load balanced, a query to one of the Thanos Query instances brings all 4 hosts down (due to all 4 Promehetues getting OOM).

Environment:

OS: CentOS Linux 7 (Core)
Kernel: 3.10.0-693.21.1.el7.x86_64
Docker: 17.12.0-ce, build c97c6d6
Memory: 32GB

Storage:

S3 size: 1363 GB
S3 objects: 5338

We are not sure whether the issue is caused by Prometheus or Thanos. Maybe it’s somehow directly related to our setup (for example join_joined_left_vs_right_timestamp_diff_bucket metric having a huge number of different labels, which results into a large number of different timeseries that cannot be handled when running the mentioned query). Anyhow, any guidance or tips would be really appreciated.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 3
Comments: 20 (14 by maintainers)

Most upvoted comments

Anybody an idea what’s the state on this? We currently have something which looks like the same issue. When Grafana queries prometheus directly, the dashboard finishes in just a couple of seconds. When querying through thanos query, memory usage of both sidecar and prometheus blow up, which will eventually lead to OOM.

+11

alexdepalex on Jun 6, 2019

Plus this: https://github.com/improbable-eng/thanos/issues/488 (:

bwplotka on Sep 13, 2018

https://github.com/prometheus/prometheus/pull/4532

xjewer on Sep 12, 2018