VictoriaMetrics: [vmselect] vmselect gives inconsistent results around the timestamp it was started
Describe the bug
Not sure what is causing it, but vmselect is returning inconsistent results around the timestamp the vmselect starts.
We run victoriametrics in kubernetes with multiple pods per service and with autoscaler, that means that vmselect pods come and go.
After upgrading to v1.95.1 we noticed that around the time each pod starts, the results it returns are inconsistent. See the gif below:
Those big drops, correspond to the time the vmselect pod was booted.
When running vmselect version v1.94.0 things work normally, see screenshot below:
I’ve tried multiple times switching back between versions v1.94.0 and v1.95.1 and was always able to replicate the issue in v1.95.1.
To Reproduce
I don’t know if this happens with all timeseries, but I was able to replicate it when executing the following query:
sum by(account) (up{app=~"myapp.*"})
Version
/vmselect-prod --version vmselect-20231116-195457-tags-v1.95.1-cluster-0-g1a15b0f57
Logs
(nothing out of the ordinary shows in the logs)
Screenshots
No response
Used command-line flags
No response
Additional information
No response
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 1
- Comments: 15 (9 by maintainers)
Commits related to this issue
- app/vmselect: properly adjust the lower bound for the time range where raw samples must be selected for default_rollup() function Previously the lower bound could be too small, which could result in ... — committed to VictoriaMetrics/VictoriaMetrics by valyala 7 months ago
- app/vmselect: properly adjust the lower bound for the time range where raw samples must be selected for default_rollup() function Previously the lower bound could be too small, which could result in ... — committed to VictoriaMetrics/VictoriaMetrics by valyala 7 months ago
FYI, the fix for this issue has been included in VictoriaMetrics v1.96.0.
Built and ran
vmselectas indicated in the last comment and the weird inconsistencies I experienced before where gone 🥳Thank you all for the help and guidance in debugging and providing you with the necessary info, and thank you for the fix ❤️
I’ll close the issue as it will be fixed in the next release. Thanks once again 🙇
It’s a old bug with response caching mechanism. Currently, cache works incorrectly for request with
step<scrape_interval.Since cached and live request parts doesn’t overlap, it’s possible to get a NaN value for datapoints, that doesn’t have values for small step.
Mitigation for this issue - disable cache.
Possible solution, allow cached and live parts to overlap and merge overlapped part.