thanos: Query via Thanos causes Prometheus to OOM
Thanos, Prometheus and Golang version used
- Thanos components: improbable/thanos:master-2018-07-27-ecfce89
- Prometheus: prom/prometheus:v2.3.2
What happened
The query provided below timeouts (after the default 2 minutes query timeout), but before it does that, Prometheus gets OOM.
What you expected to happen
Either the query to complete successfully, or timeout (assuming it’s too complex) without bringing Prometheus down.
How to reproduce it (as minimally and precisely as possible):
We are using the following command to perform the query:
curl "https://thanosquery/api/v1/query_range?query=label_replace(%0A%20%20histogram_quantile(%0A%20%20%20%200.999%2C%0A%20%20%20%20sum(%20%20%20%20%20%20rate(join_joined_left_vs_right_timestamp_diff_bucket%7Bpartition%3D~%220%7C1%7C2%7C3%7C4%7C5%7C6%7C7%7C8%7C18%7C19%7C20%7C21%7C22%7C23%7C24%7C25%7C26%7C27%7C28%7C29%7C30%7C31%7C32%7C33%7C34%7C35%22%2Cmarathon_app%3D~%22%2Fsome%2Fapp%2Fregion%2Fjoin.*%22%7D%5B5m%5D)%0A%20%20%20%20)%20by%20(le%2C%20marathon_app%2C%20partition)%0A%20%20)%2C%20%0A%20%20%22region%22%2C%20%22%241%22%2C%20%22marathon_app%22%2C%20%22%2Fsome%2Fapp%2F(.*)%2Fjoin.*%22%0A)&start=1532343000&end=1532948400&step=600"
Running this once or twice, always brings our nodes down. If we run the same query directly against Prometheus (not via Thanos Query), it completes quite fast successfully (we store 24h data on Prometheus).
Anything else we need to know
We are running independent Docker hosts with the following containers:
- Prometheus
- Thanos Query
- Thanos Store
- Thanos Sidecar
Thanos Compactor is running independently on a different host. We also try running Thanos Store on a different host than other containers.
One thing to note, that if we have multiple (let’s say have 4 nodes) this configuration and put Thanos Query under an Nginx-based load balanced, a query to one of the Thanos Query instances brings all 4 hosts down (due to all 4 Promehetues getting OOM).
Environment:
- OS: CentOS Linux 7 (Core)
- Kernel: 3.10.0-693.21.1.el7.x86_64
- Docker: 17.12.0-ce, build c97c6d6
- Memory: 32GB
Storage:
- S3 size: 1363 GB
- S3 objects: 5338
We are not sure whether the issue is caused by Prometheus or Thanos. Maybe it’s somehow directly related to our setup (for example join_joined_left_vs_right_timestamp_diff_bucket
metric having a huge number of different labels, which results into a large number of different timeseries that cannot be handled when running the mentioned query). Anyhow, any guidance or tips would be really appreciated.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 3
- Comments: 20 (14 by maintainers)
Anybody an idea what’s the state on this? We currently have something which looks like the same issue. When Grafana queries prometheus directly, the dashboard finishes in just a couple of seconds. When querying through thanos query, memory usage of both sidecar and prometheus blow up, which will eventually lead to OOM.
Plus this: https://github.com/improbable-eng/thanos/issues/488 (:
https://github.com/prometheus/prometheus/pull/4532