VictoriaMetrics: vmctl: last request takes a long time and seems to do nothing
Is your question request related to a specific component?
vmctl native migration between victoria-metrics-single-server instances
Describe the question in detail
Sometimes when running vmctl native, I notice that there may be a final request that takes a long time (e.g. 30 minutes when migrating 1 week of data with --vm-concurrency 25). During this time there is no ingestion at the destination, according to grafana metrics.
For example:
$ ./vmctl-prod vm-native \
--vm-native-src-addr=http://gx-vmetrics-victoria-metrics-single-server:8428 \
--vm-native-dst-addr=http://gx-vmetrics2-victoria-metrics-single-server:8428
\
--vm-native-filter-time-start='2023-10-25T00:00:00Z' \
--vm-native-filter-time-end='2023-11-01T00:00:00Z' \
--vm-concurrency 25
2023/11/21 16:29:07 Initing import process from "http://gx-vmetrics-victoria-metrics-single-server:8428/api/v1/export/native" to "http://gx-vmetrics2-victoria-metrics-single-server:8428/api/v1/import/native" with filter
filter: match[]={__name__!=""}
start: 2023-10-25T00:00:00Z
end: 2023-11-01T00:00:00Z
2023/11/21 16:29:07 Exploring metrics...
Found 1381 metrics to import. Continue? [Y/n]
2023/11/21 16:29:27 Selected time range will be split into 2 ranges according to "month" step. Requests to make: 2762
2023/11/21 20:38:34 Import finished!
2023/11/21 20:38:34 VictoriaMetrics importer stats:
time spent while importing: 4h9m27.34464171s;
total bytes: 371.2 GB;
bytes/s: 24.8 MB;
requests: 2762;
requests retries: 0;
2023/11/21 20:38:34 Total time: 4h9m27.345888146s
During the run, I noticed that the status was at the final request for about 30 minutes.
According to grafana, ingestion at the destination actually stopped at about 20:00.
Is the final request hanging somehow? Is there some bug in vmctl?
Troubleshooting docs
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 1
- Comments: 75 (61 by maintainers)
Just finished migration of a 1y worth of data in the following setup:
./vmctl-prod vm-native --vm-native-src-addr=http://source-cluster --vm-native-dst-addr=http://destination-vm-single:8428 --vm-concurrency=16 --vm-native-filter-time-start='-1y'ResponseHeaderTimeout: time.Minuteand retries if ImportPipe fails. No reader closing.The migration took 4h, it had 2 request retries reported.
The destination instance resource usage looks like the following:
CPU usage is flat, with some periodic gaps. I think, the gaps are caused by inconsistency of imported metrics (some are heavier than other).
Here’s how the destination vm-single looks like:
The ingestion speed on avg was > 20Mil samples/s. In total, it accepted 256Billion of data samples, 16Mil of unique time series.
I haven’t observe delays in ingestion like described here. Once vmctl reached 100% it immediately exited. Actually, it printed a journal report which crashed the ssh-tab, so I’m going to re-run this once again.
The source instance didn’t look busy at all
I’m going to make another test to run source, destination and vmctl on separate instances to involve network even more (and properly save vmctl journal log).
Looks good.
To summarize, there are two significant
vmctlpatches:Hello @belm0!
My bad! I didn’t check it properly. Apologies.
While you bear with us on this, could you try using the following patch?
this should ensure retries if you got err at
p.dst.ImportPipe. I tested it locally and it worked good.It’s a case with two metrics having long GotFirstResponseByte delay (4m, 10m).
vmctl_trace_out_3.txt
My point however is that the variance in request durations may induce the problem somehow. If we’re talking about trying to reproduce the problem of my environment, I think the shape of the data should match (or actually use our data).
(I’m open to using Slack if it’s better for the rapid interaction.)
I don’t think it’s that specific bug, since
vmctland the victoria metrics server are running on kernel 5.15 (kernel < 5.1 is required for the bug).By the way, we’re using victoria metrics server version 1.93.5.
I added instrumentation to ~
ExportPipe()~ImportPipe(), and have found a few strange things.Here is the log from a run of 2 day range, which had 13 min. no-ingestion period at the end.
observations:
runSingle())ExportPipe()~ImportPipe()calls to finish is what causes the no-ingestion period, even though the corresponding requests have already completed (runSingle()reached<-donestatement)ExportPipe()~ImportPipe()durations greatly exceed therunSingle()duration. For example,container_open_sockets(22 min. vs 40 min.)ExportPipe()~ImportPipe()calls causing the exit delay correspond to relatively small requests. For example, request fornode_hwmon_pwm_target_tempwas only 128 MB and 1 min. duration (yetImportPipe()ran for 16 min.)Especially (3) and (4) seem odd, do you have any explanation?
Sorry, I missed times in the graph and in the vmctl logs. I will try to check on our installation and will see it is a bug or not.