VictoriaMetrics: [vmselect] vmselect gives inconsistent results around the timestamp it was started

Describe the bug

Not sure what is causing it, but vmselect is returning inconsistent results around the timestamp the vmselect starts. We run victoriametrics in kubernetes with multiple pods per service and with autoscaler, that means that vmselect pods come and go. After upgrading to v1.95.1 we noticed that around the time each pod starts, the results it returns are inconsistent. See the gif below:

output

Those big drops, correspond to the time the vmselect pod was booted.

When running vmselect version v1.94.0 things work normally, see screenshot below:

Screenshot 2023-11-24 at 11 15 24

I’ve tried multiple times switching back between versions v1.94.0 and v1.95.1 and was always able to replicate the issue in v1.95.1.

To Reproduce

I don’t know if this happens with all timeseries, but I was able to replicate it when executing the following query:

sum by(account) (up{app=~"myapp.*"})

Version

/vmselect-prod --version vmselect-20231116-195457-tags-v1.95.1-cluster-0-g1a15b0f57

Logs

(nothing out of the ordinary shows in the logs)

Screenshots

No response

Used command-line flags

No response

Additional information

No response

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 1
  • Comments: 15 (9 by maintainers)

Commits related to this issue

Most upvoted comments

FYI, the fix for this issue has been included in VictoriaMetrics v1.96.0.

Built and ran vmselect as indicated in the last comment and the weird inconsistencies I experienced before where gone 🥳

Thank you all for the help and guidance in debugging and providing you with the necessary info, and thank you for the fix ❤️

I’ll close the issue as it will be fixed in the next release. Thanks once again 🙇

It’s a old bug with response caching mechanism. Currently, cache works incorrectly for request with step < scrape_interval.

Since cached and live request parts doesn’t overlap, it’s possible to get a NaN value for datapoints, that doesn’t have values for small step.

Mitigation for this issue - disable cache.

Possible solution, allow cached and live parts to overlap and merge overlapped part.