thanos: query: auto-downsampling causes inaccurate output of metrics (inflated values)

Thanos, Prometheus and Golang version used

image: improbable/thanos:v0.3.1

and

image: improbable/thanos:v0.3.2

I was able to replicate it on both versions.

Prometheus 2.7.1

What happened When --query.auto-downsampling is enabled on the query component, metrics beyond two days would be ballooned by multiples of the actual result. In this case, we’ve seen the metrics values go 10x.

PromQL:

sum(dest_bps_in{hostname=~"$hostname", exported_namespace=~"$namespace"}) by (service_name, exported_namespace) * 8

Auto-downsampling enabled (grafana v5.3.4): Screenshot-2019-3-14 Grafana - (k8s) BalanceD Service At A Glance(1)

Auto-downsampling disabled (grafana v.5.3.4) - these metrics are accurate: Screenshot-2019-3-14 Grafana - (k8s) BalanceD Service At A Glance

Another one with auto-downsampling enabled (grafana v6.0.1): Screenshot-2019-3-14 New dashboard - Grafana

What you expected to happen Metrics to be accurate irregardless of auto-downsampling is enabled or not.

How to reproduce it (as minimally and precisely as possible):

        - --retention.resolution-raw=30d
        - --retention.resolution-5m=90d
        - --retention.resolution-1h=365d

on compactor

  • Enable auto-downsampling, observe any metrics with 30 days window in Grafana. Metrics are inaccurate, and when zooming back in to a smaller window, the metrics become accurate again. Disable auto-downsamping, observe any metrics with 30 days windows in Grafana. Metrics are accurate.

Full logs to relevant components thanos bucket inspect output

Logs

|            ULID            |        FROM         |        UNTIL        |     RANGE     |   UNTIL-COMP   |  #SERIES  |    #SAMPLES    |   #CHUNKS   | COMP-LEVEL | COMP-FAILED |                                                                   LABELS                                                                   | RESOLUTION |  SOURCE   |
|----------------------------|---------------------|---------------------|---------------|----------------|-----------|----------------|-------------|------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|
| 01D5C9DF7VXBRQPCR8P9HF0ERH | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.639s | -152h8m55.639s | 1,562,415 | 44,642,000,050 | 373,976,728 | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5D42HA6ES2XJM5XNN0EZ7VT | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.663s | -152h8m55.663s | 1,562,599 | 44,651,399,075 | 373,977,040 | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5D8J10JXA8FTFRPGSA52CNN | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.639s | 47h51m4.361s   | 1,562,415 | 3,459,134,615  | 25,743,172  | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5DF724T11E1WX5W6SAPJJ4R | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.663s | 47h51m4.337s   | 1,562,599 | 3,460,647,340  | 25,743,356  | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5G8YD96MC0JG89X1BM60ANE | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | -8h0m0s        | 1,588,659 | 11,285,984,868 | 94,100,935  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5G9XG672QFRTWJ4C7CAMZNF | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | -8h0m0s        | 1,588,728 | 11,286,004,997 | 94,101,042  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5GAZWSJXNGCG44NVTPJKC6J | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | 192h0m0s       | 1,588,658 | 874,836,605    | 7,651,554   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5GCD7FRDZS5YBS78SGF4X0S | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | 192h0m0s       | 1,588,728 | 874,836,712    | 7,651,624   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5NE0GQPJXX86J8F1N64R0G5 | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,592,420 | 11,349,456,418 | 94,636,942  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5NEYKFR32V0Y4CNP4F20KGC | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,592,436 | 11,349,476,561 | 94,636,972  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5NFZS7T5KX10N84NVAGKTW2 | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,592,419 | 880,110,425    | 7,696,584   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5NHF3JXX7JZR52HSYQVW86B | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,592,435 | 880,110,409    | 7,696,600   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5TRKJ8ZTT4RA81F8X2H08HA | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,659,932 | 11,414,008,950 | 95,216,437  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5TSEMGHA3P1FCWC5P06J6QB | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,660,023 | 11,414,028,755 | 95,216,496  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5TTAE3NYKETYZPQ4B35G79P | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,659,871 | 885,270,232    | 7,788,901   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5TVKN3QK4FCMNZ14H9BC2HZ | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,659,962 | 885,270,286    | 7,788,992   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5VDBQD2YN5YSTETC1HFW89K | 12-03-2019 17:00:00 | 13-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,552,580 | 1,893,087,654  | 15,924,597  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5VDJSHWEKASM3Q7AG19GDA1 | 12-03-2019 17:00:00 | 13-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,552,545 | 1,892,299,071  | 15,924,289  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5W93Q11Y8XGN10D3P8W4AJG | 13-03-2019 01:00:00 | 13-03-2019 09:00:00 | 8h0m0s        | 32h0m0s        | 1,580,955 | 1,910,657,289  | 15,953,592  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5W9JYC4D4MRYN53F5NS8D55 | 13-03-2019 01:00:00 | 13-03-2019 09:00:00 | 8h0m0s        | 32h0m0s        | 1,580,949 | 1,910,653,297  | 15,953,550  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5X49TXE2FPRN1XKPAJCAZNB | 13-03-2019 09:00:00 | 13-03-2019 17:00:00 | 8h0m0s        | 32h0m0s        | 1,546,268 | 1,891,744,574  | 15,479,228  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5X4G6QTG00SBS74Z3ZS5WF6 | 13-03-2019 09:00:00 | 13-03-2019 17:00:00 | 8h0m0s        | 32h0m0s        | 1,546,270 | 1,891,253,611  | 15,479,221  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5XZRJBRTZ8ZRE6YQPDHXQNA | 13-03-2019 17:00:00 | 14-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,556,778 | 1,911,067,151  | 15,936,507  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5Y002A7A24Y55WWQ8V84Z7V | 13-03-2019 17:00:00 | 14-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,556,774 | 1,911,064,040  | 15,936,467  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5XXQPZJ96WTWG5Z7PZBBPDD | 14-03-2019 01:00:00 | 14-03-2019 03:00:00 | 2h0m0s        | 38h0m0s        | 1,544,246 | 477,747,855    | 3,981,488   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5XXQPZM56832FMDBFCQ3SPP | 14-03-2019 01:00:00 | 14-03-2019 03:00:00 | 2h0m0s        | 38h0m0s        | 1,544,238 | 477,748,831    | 3,981,484   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |
| 01D5Y4KE4GR3HDZAY8K8SBZH1X | 14-03-2019 03:00:00 | 14-03-2019 05:00:00 | 2h0m0s        | 38h0m0s        | 1,546,542 | 477,752,906    | 3,983,785   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5Y4KE4HXB5M4E85HK02N54P | 14-03-2019 03:00:00 | 14-03-2019 05:00:00 | 2h0m0s        | 38h0m0s        | 1,546,548 | 477,753,874    | 3,983,791   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |
| 01D5YBF5C7BN8DJ27C7493GRAJ | 14-03-2019 05:00:00 | 14-03-2019 07:00:00 | 2h0m0s        | 38h0m0s        | 1,544,508 | 477,740,036    | 3,981,739   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5YBF5CJPTW2X4G5KZ58NS14 | 14-03-2019 05:00:00 | 14-03-2019 07:00:00 | 2h0m0s        | 38h0m0s        | 1,544,510 | 477,740,921    | 3,981,741   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |


Anything else we need to know Using Grafana v5.3.4 and v6.0.1. Could this be a grafana bug?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 23 (12 by maintainers)

Most upvoted comments

Thanks for report, I have seen similar problem on our prod as well, but fortunately we can fallback to raw data always. I am pretty sure this path is not well tested, so some bug in choosing blocks might happen.

High priority IMO.

Just rolled out 0.6.0-rc.0 for the querier and it looks good. Both sum and avg are returning what we expect with Max 5m Downsampling.

So yeah, you’ve demonstrated Spike Erosion quite well here. Definitely a well understood side effect when you downsample by averaging or your graph display toolkit uses a weighted averages to dynamical resize the graph.

Being that we have min, max, sum, count, and (therefore) average for each downsampled data point, I bet that we are using the sum value of the downsampled data point when we use the sum() function…which leads to these inflated results. However, when sum_over_time() is used, this is exactly what we want to do.

Being that Spike Erosion is usually controlled by controlling the downsampling aggregation function, do we need to expose how to select the min. max, sum, or count when working with downsampled data?

Another way to handle Spike Erosion is by using and aggregating histograms to build a quantile estimation. That too is going to require sum() over Counter type data and probably work best with max as the aggregation function.

That’s because thanos-query ui works the same. See below for examples.

It’s a pretty generic tool that draws values for whatever time series you throw at it.

It doesn’t always make sense to group the values when “zooming out”. For example if you have a time series for some status, like up. It wouldn’t make sense to display “3” if the UI step is 3 times the scraping period. It has no way of knowing what the value represents, so it goes with the safest route, which is sampling, ie dropping values.

In my opinion there should be a broader documentation in Thanos about how this works and how this interacts with graphing tools. I think the most surprising things happen when the graphing tool has a resolution in between downsampling intervals, say 20 minutes in the case of Thanos. If you sum your values, you’ll get a partial sum for that period, which is weird for me.

The way to look at this is that downsampling loses resolution. Instead of five values, one every 5 minutes, you only get one, which isn’t equal to any of them. (You actually get several: min, max, sum, count - see #813 - this allows to retain some idea about what the data distribution was).

I think what’s a bit confusing is that asking for just one sample gives an average, so the value isn’t clearly wrong when compared to raw data (but it should be noted that if the raw data is somewhat random, they don’t match!) People probably don’t expect that summing different dimensions would actually sum different values that what’s returned without the sum for a query that’s identical otherwise.

The graphs below have samples scraped every minute. This is thanos-query UI v0.5.0.


Sampling and missing data: same series, different “zoom”: The first is over the last 12h. Notice the peak at 60. Screenshot 2019-06-13 at 18 47 08

The second is over the last two days. Notice the maximum barely hits 30. There’s information missing (focus on the graph that’s present, the series was only created yesterday). Screenshot 2019-06-13 at 18 48 05


Same series, 1 scrape per minute in raw data, downsamples are to 5 minutes. See how the shape of the curve changes with what is displayed.

  • 60s resolution, only raw data. Notice the peak above 8000. 60s - raw

  • 60s resolution, downsampled. The peak is gone, and there are 5m plateaus. This is a smoothed version, because it’s the average for each plateau. 60s - ds

  • 60s resolution, downsampled, max. The peak is back, and the shape is roughly the same. 60s - ds - max

  • 300s raw. This is where it gets interesting. Notice the missing features. But there are no plateaus. 300s - raw

  • 300s downsampled. It’s pretty much the same, just smoothed. Instead of sampling 1 of 5 values, it uses the average. 300s - ds

  • 300s downsampled, max. The peak is back. 300s - ds- max

Up until this point, all values are roughly the same as the original raw data. The next one is the confusing part. Note there’s just one series, so if using sum or avg on the raw data the values wouldn’t change. But on the downsampling it does, and it’s… 5 times larger ! 300s - ds - sum

The difference is the way those charts are read. The last is read “in the interval between one point and the other, there were this many requests”. That’s an aggregation. All the others read “at some time between the last two scrapes there were this many requests per unit of time”.

You’ll have to check with the series for this unit, and it’s always the same. If the range is 5 minutes, you probably don’t care. If it’s a day, and you only had 2000 requests, it makes a big difference to know whether those were for the whole day or just during one second.

@vladvasiliu I have the same issue in thanos-query ui, not just in grafana.

I have the same issue with v0.5.0

We fixed nasty bug (: Thanks to this: https://github.com/improbable-eng/thanos/pull/1146

Thanks everyone involved. ~Closing, we can reopen if anyone can repro it with v0.5.0 or newest master.~