thanos: sidecar+query: Slower than prometheus on specific query

Thanos, Prometheus and Golang version used:

# thanos --version
thanos, version 0.10.1 (branch: HEAD, revision: bdcc35842f30ada375f65aaf748a104b43d56672)
  build user:       circleci@4e51e880cd24
  build date:       20200124-07:36:32
  go version:       go1.13.1

Prometheus 2.16.0

What happened:

We currently manage two Kubernetes clusters (GKE), each running their own prometheus-operator and Prometheus with three replicas, with the Thanos Sidecar. We deploy Thanos Query in one of the clusters, setup to point to all Prometheus in both clusters. The Sidecar isn’t uploading metrics to object storage yet, and prometheus has a few months worth of metrics.

There are some queries that are much more slower in Thanos Query than in Prometheus.

For examply:

sum(open_connections{job="harvestapp-websocket", namespace="default"}) by(pod)

In Prometheus, 7 days of data, ~1.5s:

In Thanos Query, without deduplication, ~11s:

In Thanos Query, with deduplication, ~17s:

We recently tried promxy to see if it had the same problem, and it seems to perform only marginally slower than Prometheus:

This is a metric we track per pod, and in this case, it’s an autoscaled deployment (we autoscale on this specific metric in fact), hence the different numbers of Time series from the screenshots, it kept growing as I was writing this due to time of day. We don’t necessarily need to use Thanos Query in this specific case, but we’re pointing Grafana’s default data source to it to simplify things. This was one very specific example query, but our Grafana panel loads a few more things, which, with Thanos Query, times out if we load anything more than a few hours worth of metrics.

I understand it might not be the ideal situation with prometheus, having a label that keeps growing in cardinality with time. I would expect Thanos Query to be a bit slower (a small % than the slowest prometheus, not 10x). Could this be considered a bug? To be expected? Will we see the same problem if we deploy the Store component?

I thought it might be related to https://github.com/thanos-io/thanos/issues/2222, but that one only seems to mention deduplication (in our case deduplication is indeed slower than without), and in our case things are slower regardless.

Thanks a lot for your time! We’d really like to use Thanos, it’s been a pleasure to setup everything and it’d be the ideal solution for us.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 5
Comments: 26 (7 by maintainers)

Most upvoted comments

ChuncIter is in progress. Actually looking to finish that this week!

bwplotka on Jun 1, 2020

Thanks for this. This is because promxy uses Query API which is quite scary (data goes through PromQL, so it’s not really raw). This might cause some correctness issues. They allow choosing using remote_read which then uses a similar path as sidecar but in the old version, so it will suffer from major memory consumption on both Prometheus and promxy. I am also not familiar with how they do deduplication but probably similar approaches to our penalty-based algorithm.

Thanos does it on the mentioned remote read Prometheus API so pure raw data level. Since this API was old and had lots of limitations we are keeping eyes on it and actually maintain remote read on Prometheus side. For example we added very needed streaming which causes sidecar to use ~const amount of memory per request. (details).

Anyway I think its not deduplication latency reason in your case. The problem is the missing ChunkIterator on TSDB, which I am currently adding to Prometheus. 🤗 (https://github.com/prometheus/prometheus/pull/7069 & https://github.com/prometheus/prometheus/pull/7059) Will notify here once ready for some testing,

bwplotka on Apr 1, 2020

Hm… it’s so old issue that first let’s figure if this is still happening (latency is visible and it has to do particularly Chunk Iterator part).

I know @spaparaju is on https://github.com/thanos-io/thanos/issues/4304 which is similar. Let’s investigate (:

bwplotka on Jun 23, 2021

Ups, was closed, but I don’t think this is resolved.

bwplotka on Feb 10, 2021

It’s been a while, but I think we were still seeing this issue post update. Happy to provide more data if it will help us get to the bottom of this! Improving the query latency here would be huge for us!

georgeluong on Mar 15, 2021

Hey, yes it was merged, and nothing to be changed on our side. We are waiting for Prometheus release AND some tests that would confirm that this actually helped. With chunk iterator indeed there is no theoretical reason why it could be slower, but let’s confirm before closing this 🤗

bwplotka on Aug 26, 2020

Still in progress 😢

bwplotka on May 2, 2020