VictoriaMetrics: Some metrics are lost and I don't know how to debug it
Is your question request related to a specific component?
I don’t know where the data is lost, I can’t find any error and would like guidance on how to proceed on my debug.
Describe the question in detail
I have a central single-node Victoria Metrics and a few decentralized vmagents.
I have some values that are scraped correctly but cannot be found in the database with a query.
graph TD;
vmagent1 --> vmauth;
vmagent2 --> vmauth;
vmagent3 --> vmauth;
vmauth --> victoria-metrics;
This happens across different metrics in different scrapes, I will take one just as an example, but I have found no common rules among the “lost” metrics; they are all around (and rare enough, but still bothersome/alarming).
On http://vmagent1:8429/targets I can see:
| Endpoint | State | Labels | Debug relabeling | Scrapes | Errors | Last Scrape | Duration | Samples | Last error |
|---|---|---|---|---|---|---|---|---|---|
| http://vmagent1:9100/metrics (response) | UP | {instance=“vmagent1:9100”, job=“node_exporter”} | target metrics | 887 | 0 | 12372ms ago | 4ms | 185 |
And the number of values seems to be correct:
% xh http://localhost:9100/metrics | egrep -cv '^#'
185
The server’s vmagent.log reports no errors in writing and I get updated values for most of those 185 values. But… count({datacenter="DC1",job="node_exporter"}) returns 122 (in internal /vmui and also in Grafana Explorer).
The job label is added by vmagent scraping rules:
- job_name: 'node_exporter'
static_configs:
- targets:
- vmagent1:9100
The datacenter label is forced by vmauth:
users:
- username: "user1"
password: "pass1"
url_map:
- src_paths: ["/api/v1/write"]
url_prefix: "http://localhost:8428?extra_label=datacenter=DC1"
A missing metric is one I noticed because is used in Grafana to populate pop-ups in the upper-left corner, and that is node_uname_info, it just doesn’t show up (and it is among the 185 scraped by connecting to node_exporter’s :9100 output).
Troubleshooting docs
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 26 (12 by maintainers)
Commits related to this issue
- app/vmagent/remotewrite: fix data race when extra labels are added to samples before sending them to multiple remote storage systems See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4972 — committed to VictoriaMetrics/VictoriaMetrics by valyala 10 months ago
- app/vmagent/remotewrite: fix data race when extra labels are added to samples before sending them to multiple remote storage systems See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4972 — committed to VictoriaMetrics/VictoriaMetrics by valyala 10 months ago
- app/vmagent/remotewrite: fix data race when extra labels are added to samples before sending them to multiple remote storage systems See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4972 — committed to VictoriaMetrics/VictoriaMetrics by valyala 10 months ago
- app/vmagent/remotewrite: fix data race when extra labels are added to samples before sending them to multiple remote storage systems See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4972 — committed to VictoriaMetrics/VictoriaMetrics by valyala 10 months ago
- lib/storage: log fatal error inside searchMetricName() instead of propagating it to the caller This simplifies the code a bit at searchMetricName() and searchMetricNameWithCache() call sites This is... — committed to VictoriaMetrics/VictoriaMetrics by valyala 9 months ago
- lib/storage: log fatal error inside searchMetricName() instead of propagating it to the caller This simplifies the code a bit at searchMetricName() and searchMetricNameWithCache() call sites This is... — committed to VictoriaMetrics/VictoriaMetrics by valyala 9 months ago
- lib/storage: log fatal error inside searchMetricName() instead of propagating it to the caller This simplifies the code a bit at searchMetricName() and searchMetricNameWithCache() call sites This is... — committed to AndrewChubatiuk/VictoriaMetrics by valyala 9 months ago
@lapo-luchini , thanks for the update! Then closing this issue as resolved.
The missing metric is back! 😃
count() of metrics has increased, I see still some batches of new metricsid:
The fact that they are all sequential lead me to thing it maybe it happening this as described:
I’m switching now back to standard release.
This metric is incremented only during queries, which select time series with missing
metricID->TSIDentries atindexdb.@lapo-luchini and @salarali , you can build single-node VictoriaMetrics from the latest commit in the branch https://github.com/VictoriaMetrics/VictoriaMetrics/tree/issue-4972-add-missing-tsids (currently this is 641db141899e7b266b3d02c48ad58c12ae86cc8c ) according to these docs and try running it for a while - it should re-create missing
metricID->TSIDentries in theindexdb. It should log the following message per each missing entry:Note that it may create duplicate
MetricID->TSIDentries for newly registered time series. This should be OK.Ohh, thanks for the analysis! This might be useful as a reference if anybody else had done that downgrade by mistake too, I hope.
That’d be nice! Just a few minute should be enough for all scrapes to arrive. If you have a small patch (or a fork / branch) I can compile that myself with no problems. (VM 1.93 is not yet officially on FreeBSD and I’m trying to help with that, so I’m compiliting it myself just to use it anyways)
The provided traces show that VictoriaMetrics cannot find TSID entry for the metricID associated with the
node_uname_infotime series, while it successfully finds the TSID entry for the metricID associated with thenode_time_secondstime series.metricIDis a 64-bit number, which uniquely identifies every time series stored in VictoriaMetrics.TSIDis a data structure, which is used as a sorting and search key for data blocks stored on disk. It contains multiple other fields such as MetricGroupID, JobID and InstanceID additionally to MetricID. VictoriaMetrics sorts time series blocks byTSIDin order to reduce the number of disk read operations needed to read time series data for the same metric name, the samejoblabel and/or the sameinstancelabel.VictoriaMetrics increments
vm_missing_tsids_for_metric_id_totalmetric every time it cannot findTSIDby the givenmetricID, since this is the expected case when the createdmetricID->TSIDentry isn’t available for search yet during a few seconds after registering new time series. It looks like in your case themetricID->TSIDentry is permanently missing in theindexdb. This case is unexpected. This case may indicate to partial loss ofindexdbdata. The data may be lost there if some sub-directories under<-storageDataPath>/indexdb/<indexDBGeneration>directory were unexpectedly deleted. These sub-directories are also known as indexdbparts. VictoriaMetrics itself shouldn’t delete these sub-directories unexpectedly - it deletes them only during background merge into bigger parts, and it makes sure that the source directories are deleted only after the resulting directory is created and completely saved to persistent storage. It is likely the subdirectories were deleted during accidental downgrade from to v1.91 to v1.87 mentioned here.How to restore the deleted
metricID->TSIDentries? We can provide a custom VictoriaMetrics binary, which will re-create missingmetricID->TSIDentries during data ingestion. The neededTSIDentry is located during data ingestion when VictoriaMetrics searches forTSIDby the metric name plus all the labels for the ingested sample (this is known asMetricNamein VictoriaMetrics source code).You can run this VictoriaMetrics binary during a few scrape intervals in order to make sure
metricID->TSIDentries are re-created for all the active time series, and then return back to the officialv1.93.4VictoriaMetrics binary. It is unsafe to run the custom binary for long period of time, since it may create duplicatemetricID->TSIDentries for newly ingested time series when these entries aren’t visible for search yet.There are chances that some other information is missing in the
indexdbafter the unexpected deletion of its parts. So other issues may arise in the future.Ah, ok, that
191count is indeed right: those are the185from node_exporter port :9100 plus the6generated by scraping itself (which have the very same labels):Happy hunting!
If more tests are useful on my side, just let me know.
Upgraded
vmagentto v1.93.4, changed nothing (of course).Added a local
victoria-metricson vmagent1 server, added as a second remoteWrite, it does receive the metric. Strangelycount({job="node_exporter"})reports 191 instead of 185, butnode_uname_infodoes show up correctly.I then added
-remoteWrite.url=http://localhost:8428/api/v1/write?extra_label=datacenter=DC1and it still works correctly (I also tried using a different value).So it seems that reproducing the bug needs historical data, as I guess that
vmauthrole in the data is so small that it can be ignored. (?)It looks like you hit another issue then. There are high chances the issue is in the
victoria-metricssource code.Could you start replicating the scraped data to a new
victoria-metricsinstance with thevmagentbuilt from https://github.com/VictoriaMetrics/VictoriaMetrics/commit/0bbc6a5b43209cb3b9c64bc2fbc9b33ef46b26df ? E.g. you need to specify an additional-remoteWrite.urlcommand-line flag to thevmagent, so it will replicate the scraped data to the newvictoria-metricsinstance. Then verify whether the newvictoria-metricsinstance v1.93.3 returns the correct number fromcount({job="node_exporter"})and whether it properly locates thenode_uname_infometric. This should help determining whether the issue is reproduced on a fresh installation ofvictoria-metricsor it needs historical data.If everything will be OK, then try adding
?extra_label=datacenter=DC1"to the-remoteWrite.urlof the newvictoria-metricsand verifying again whether the newvictoria-metricsreturns the expected results. This should help determining whether the issue is related to the addition of extra labels viaextra_labelquery arg during data ingestion.Sure, I’ll build and upgrade that vmagent ASAP (probably on monday).
But I have two doubts: