VictoriaMetrics: query results may incorrectly overlap time series

Describe the bug

If query step is much greater than data interval (e.g. > 4x), and two series are adjacent in time but not overlapping, the query output or aggregation may incorrectly overlap the time series.

A typical example is build_info{version="..."}. A new version is deployed to instances, which stop updating build_info{version="1.0"} and start updating build_info{version="1.1"}. Due to this bug, at low zoom resolution there will be a point in time where count(build_info{instance="...")) returns 2, even though there is no overlap in the raw data points.

The equivalent query on Prometheus (Thanos) does not exhibit the problem.

To Reproduce

Raw datapoints:

build_info{instance="foo"}[20m]

time                 version
2020-07-22 00:45:10  20.05.2
2020-07-22 00:46:10  20.05.2
2020-07-22 00:47:10  20.05.2
2020-07-22 00:48:10  20.05.2
2020-07-22 00:51:56  20.05.3
2020-07-22 00:52:56  20.05.3
2020-07-22 00:58:56  20.05.3
2020-07-22 01:02:11  20.05.3

query, step 60 (no overlap)

build_info{instance="foo}

2020-07-22 00:47:00  20.05.2
2020-07-22 00:48:00  20.05.2
2020-07-22 00:49:00  20.05.2
2020-07-22 00:52:00  20.05.3
2020-07-22 00:53:00  20.05.3
2020-07-22 00:54:00  20.05.3

query, step 240

build_info{instance="foo}

2020-07-22 00:48:00  20.05.2
2020-07-22 00:52:00  20.05.2    ** overlap **
2020-07-22 00:52:00  20.05.3    ** overlap **
2020-07-22 00:56:00  20.05.3
2020-07-22 01:00:00  20.05.3

Expected behavior

If two series are not overlapping in time by raw data, the query should not treat them as overlapping when evaluating one interval in the output.

An example implementation would be to treat “start” and “end” points of a series differently when quantizing raw data points into time buckets: include the series in the bucket if 1) raw points are continuous in the bucket range, or 2) the series starts in the bucket range. Therefore, if a series ends in a bucket range, it is not included. It’s similar to the concept of an open-ended range.

Screenshots

Example graph showing artificial spikes in count(build_info) when there is a deployment causing the version label to change: Screen Shot 2020-09-05 at 12 45 38 AM

Version

victoria-metrics-20200815-125320-tags-v1.40.0-0-ged00eb3f3

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (5 by maintainers)

Commits related to this issue

Most upvoted comments

Thank you. For discrepancies like this, it would be nice for VM to have unit tests against the output of the Prometheus query library.

@belm0 , thanks for the detailed bug report and the proposed solution! The solution looks good. We’ll try implementing it and see how it works.

Now I see the opposite problem, where series unexpectedly disappear before they end (for example at head of the series).

I think it’s related to my comment on the commit about correctness of the 90% heuristic.

Screen Shot 2020-10-14 at 2 21 48 PM