VictoriaMetrics: Some metrics are lost and I don't know how to debug it

Is your question request related to a specific component?

I don’t know where the data is lost, I can’t find any error and would like guidance on how to proceed on my debug.

Describe the question in detail

I have a central single-node Victoria Metrics and a few decentralized vmagents. I have some values that are scraped correctly but cannot be found in the database with a query.

graph TD;
  vmagent1 --> vmauth;
  vmagent2 --> vmauth;
  vmagent3 --> vmauth;
  vmauth --> victoria-metrics;

This happens across different metrics in different scrapes, I will take one just as an example, but I have found no common rules among the “lost” metrics; they are all around (and rare enough, but still bothersome/alarming).

On http://vmagent1:8429/targets I can see:

Endpoint State Labels Debug relabeling Scrapes Errors Last Scrape Duration Samples Last error
http://vmagent1:9100/metrics (response) UP {instance=“vmagent1:9100”, job=“node_exporter”} target metrics 887 0 12372ms ago 4ms 185

And the number of values seems to be correct:

% xh http://localhost:9100/metrics | egrep -cv '^#'
185

The server’s vmagent.log reports no errors in writing and I get updated values for most of those 185 values. But… count({datacenter="DC1",job="node_exporter"}) returns 122 (in internal /vmui and also in Grafana Explorer).

The job label is added by vmagent scraping rules:

- job_name: 'node_exporter'
  static_configs:
  - targets:
    - vmagent1:9100

The datacenter label is forced by vmauth:

users:

- username: "user1"
  password: "pass1"
  url_map:
  - src_paths: ["/api/v1/write"]
    url_prefix: "http://localhost:8428?extra_label=datacenter=DC1"

A missing metric is one I noticed because is used in Grafana to populate pop-ups in the upper-left corner, and that is node_uname_info, it just doesn’t show up (and it is among the 185 scraped by connecting to node_exporter’s :9100 output).

Troubleshooting docs

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 26 (12 by maintainers)

Commits related to this issue

Most upvoted comments

@lapo-luchini , thanks for the update! Then closing this issue as resolved.

The missing metric is back! 😃 image

count() of metrics has increased, I see still some batches of new metricsid:

image
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905606
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905607
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905608
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905609
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905610
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905611
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905612
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905613
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905614
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905615
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905616
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905617
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905618
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905619
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905620
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905621
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905622
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905623
2023-09-22T12:42:32.897Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905624
2023-09-22T12:43:11.491Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905625
2023-09-22T12:43:18.640Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905626
2023-09-22T12:43:18.640Z        info    lib/storage/storage.go:1914     Creating missing MetricID->TSID entry for MetricID=1695385736381905627

The fact that they are all sequential lead me to thing it maybe it happening this as described:

You can run this VictoriaMetrics binary during a few scrape intervals in order to make sure metricID->TSID entries are re-created for all the active time series, and then return back to the official v1.93.4 VictoriaMetrics binary. It is unsafe to run the custom binary for long period of time, since it may create duplicate metricID->TSID entries for newly ingested time series when these entries aren’t visible for search yet.

I’m switching now back to standard release.

I dont see the metric vm_missing_tsids_for_metric_id_total being incremented at all.

This metric is incremented only during queries, which select time series with missing metricID->TSID entries at indexdb.

@lapo-luchini and @salarali , you can build single-node VictoriaMetrics from the latest commit in the branch https://github.com/VictoriaMetrics/VictoriaMetrics/tree/issue-4972-add-missing-tsids (currently this is 641db141899e7b266b3d02c48ad58c12ae86cc8c ) according to these docs and try running it for a while - it should re-create missing metricID->TSID entries in the indexdb. It should log the following message per each missing entry:

Creating missing MetricID->TSID entry for MetricID=...

Note that it may create duplicate MetricID->TSID entries for newly registered time series. This should be OK.

It is likely the subdirectories were deleted during accidental downgrade from to v1.91 to v1.87 mentioned here.

Ohh, thanks for the analysis! This might be useful as a reference if anybody else had done that downgrade by mistake too, I hope.

How to restore the deleted metricID->TSID entries? We can provide a custom VictoriaMetrics binary, which will re-create missing metricID->TSID entries during data ingestion.

That’d be nice! Just a few minute should be enough for all scrapes to arrive. If you have a small patch (or a fork / branch) I can compile that myself with no problems. (VM 1.93 is not yet officially on FreeBSD and I’m trying to help with that, so I’m compiliting it myself just to use it anyways)

The provided traces show that VictoriaMetrics cannot find TSID entry for the metricID associated with the node_uname_info time series, while it successfully finds the TSID entry for the metricID associated with the node_time_seconds time series.

  • The metricID is a 64-bit number, which uniquely identifies every time series stored in VictoriaMetrics.
  • The TSID is a data structure, which is used as a sorting and search key for data blocks stored on disk. It contains multiple other fields such as MetricGroupID, JobID and InstanceID additionally to MetricID. VictoriaMetrics sorts time series blocks by TSID in order to reduce the number of disk read operations needed to read time series data for the same metric name, the same job label and/or the same instance label.

VictoriaMetrics increments vm_missing_tsids_for_metric_id_total metric every time it cannot find TSID by the given metricID, since this is the expected case when the created metricID->TSID entry isn’t available for search yet during a few seconds after registering new time series. It looks like in your case the metricID->TSID entry is permanently missing in the indexdb. This case is unexpected. This case may indicate to partial loss of indexdb data. The data may be lost there if some sub-directories under <-storageDataPath>/indexdb/<indexDBGeneration> directory were unexpectedly deleted. These sub-directories are also known as indexdb parts. VictoriaMetrics itself shouldn’t delete these sub-directories unexpectedly - it deletes them only during background merge into bigger parts, and it makes sure that the source directories are deleted only after the resulting directory is created and completely saved to persistent storage. It is likely the subdirectories were deleted during accidental downgrade from to v1.91 to v1.87 mentioned here.

How to restore the deleted metricID->TSID entries? We can provide a custom VictoriaMetrics binary, which will re-create missing metricID->TSID entries during data ingestion. The needed TSID entry is located during data ingestion when VictoriaMetrics searches for TSID by the metric name plus all the labels for the ingested sample (this is known as MetricName in VictoriaMetrics source code).

You can run this VictoriaMetrics binary during a few scrape intervals in order to make sure metricID->TSID entries are re-created for all the active time series, and then return back to the official v1.93.4 VictoriaMetrics binary. It is unsafe to run the custom binary for long period of time, since it may create duplicate metricID->TSID entries for newly ingested time series when these entries aren’t visible for search yet.

There are chances that some other information is missing in the indexdb after the unexpected deletion of its parts. So other issues may arise in the future.

Could you try exporting metrics for this target via /federate API

Ah, ok, that 191 count is indeed right: those are the 185 from node_exporter port :9100 plus the 6 generated by scraping itself (which have the very same labels):

scrape_duration_seconds
scrape_samples_post_metric_relabeling
scrape_samples_scraped
scrape_series_added
scrape_timeout_seconds
up

I suspect the issue may be related to the refactoring

Happy hunting!

If more tests are useful on my side, just let me know.

Upgraded vmagent to v1.93.4, changed nothing (of course).

Added a local victoria-metrics on vmagent1 server, added as a second remoteWrite, it does receive the metric. Strangely count({job="node_exporter"}) reports 191 instead of 185, but node_uname_info does show up correctly.

I then added -remoteWrite.url=http://localhost:8428/api/v1/write?extra_label=datacenter=DC1 and it still works correctly (I also tried using a different value).

So it seems that reproducing the bug needs historical data, as I guess that vmauth role in the data is so small that it can be ignored. (?)

I have only one remoteWrite URL set I’ve seen the problem starting with 1.93.0, which should be before that issue Unfortunately count({datacenter=“DC1”,job=“node_exporter”}) is still 122 instead of 185 (and that uname is still missing).

It looks like you hit another issue then. There are high chances the issue is in the victoria-metrics source code.

Any way I can help with debug?

Could you start replicating the scraped data to a new victoria-metrics instance with the vmagent built from https://github.com/VictoriaMetrics/VictoriaMetrics/commit/0bbc6a5b43209cb3b9c64bc2fbc9b33ef46b26df ? E.g. you need to specify an additional -remoteWrite.url command-line flag to the vmagent, so it will replicate the scraped data to the new victoria-metrics instance. Then verify whether the new victoria-metrics instance v1.93.3 returns the correct number from count({job="node_exporter"}) and whether it properly locates the node_uname_info metric. This should help determining whether the issue is reproduced on a fresh installation of victoria-metrics or it needs historical data.

If everything will be OK, then try adding ?extra_label=datacenter=DC1" to the -remoteWrite.url of the new victoria-metrics and verifying again whether the new victoria-metrics returns the expected results. This should help determining whether the issue is related to the addition of extra labels via extra_label query arg during data ingestion.

Sure, I’ll build and upgrade that vmagent ASAP (probably on monday).

But I have two doubts:

  1. I have only one remoteWrite URL set
  2. I’ve seen the problem starting with 1.93.0, which should be before that issue