VictoriaMetrics: vmctl: last request takes a long time and seems to do nothing

Is your question request related to a specific component?

vmctl native migration between victoria-metrics-single-server instances

Describe the question in detail

Sometimes when running vmctl native, I notice that there may be a final request that takes a long time (e.g. 30 minutes when migrating 1 week of data with --vm-concurrency 25). During this time there is no ingestion at the destination, according to grafana metrics.

For example:

$ ./vmctl-prod vm-native \
   --vm-native-src-addr=http://gx-vmetrics-victoria-metrics-single-server:8428 \
   --vm-native-dst-addr=http://gx-vmetrics2-victoria-metrics-single-server:8428
\
   --vm-native-filter-time-start='2023-10-25T00:00:00Z' \
   --vm-native-filter-time-end='2023-11-01T00:00:00Z' \
   --vm-concurrency 25
2023/11/21 16:29:07 Initing import process from "http://gx-vmetrics-victoria-metrics-single-server:8428/api/v1/export/native" to "http://gx-vmetrics2-victoria-metrics-single-server:8428/api/v1/import/native" with filter
	filter: match[]={__name__!=""}
	start: 2023-10-25T00:00:00Z
	end: 2023-11-01T00:00:00Z
2023/11/21 16:29:07 Exploring metrics...
Found 1381 metrics to import. Continue? [Y/n]
2023/11/21 16:29:27 Selected time range will be split into 2 ranges according to "month" step. Requests to make: 2762
2023/11/21 20:38:34 Import finished!
2023/11/21 20:38:34 VictoriaMetrics importer stats:
  time spent while importing: 4h9m27.34464171s;
  total bytes: 371.2 GB;
  bytes/s: 24.8 MB;
  requests: 2762;
  requests retries: 0;
2023/11/21 20:38:34 Total time: 4h9m27.345888146s

During the run, I noticed that the status was at the final request for about 30 minutes.

According to grafana, ingestion at the destination actually stopped at about 20:00.

Screen Shot 2023-11-21 at 12 55 05 PM

Is the final request hanging somehow? Is there some bug in vmctl?

Troubleshooting docs

General - https://docs.victoriametrics.com/Troubleshooting.html
vmagent - https://docs.victoriametrics.com/vmagent.html#troubleshooting
vmalert - https://docs.victoriametrics.com/vmalert.html#troubleshooting

About this issue

Original URL
State: open
Created 7 months ago
Reactions: 1
Comments: 75 (61 by maintainers)

Most upvoted comments

Just finished migration of a 1y worth of data in the following setup:

One 4vCPU 16GiB mem instance that runs x1 vmselect, x1 vmstorage, x1 vmctl. The storage is restored from one of our backups.
One 8vCPU 32GiB mem instance that runs x1 vm-single.
vmctl runs with the following command ./vmctl-prod vm-native --vm-native-src-addr=http://source-cluster --vm-native-dst-addr=http://destination-vm-single:8428 --vm-concurrency=16 --vm-native-filter-time-start='-1y'
It is a patched version of vmctl with ResponseHeaderTimeout: time.Minute and retries if ImportPipe fails. No reader closing.

The migration took 4h, it had 2 request retries reported.

The destination instance resource usage looks like the following:

CPU usage is flat, with some periodic gaps. I think, the gaps are caused by inconsistency of imported metrics (some are heavier than other).

Here’s how the destination vm-single looks like:

The ingestion speed on avg was > 20Mil samples/s. In total, it accepted 256Billion of data samples, 16Mil of unique time series.

I haven’t observe delays in ingestion like described here. Once vmctl reached 100% it immediately exited. Actually, it printed a journal report which crashed the ssh-tab, so I’m going to re-run this once again.

The source instance didn’t look busy at all

I’m going to make another test to run source, destination and vmctl on separate instances to involve network even more (and properly save vmctl journal log).

hagen1778 on Dec 13, 2023

Looks good.

To summarize, there are two significant vmctl patches:

close export reader promptly - this has a large impact on concurrency effectiveness. We use less than half the concurrency value with this patch, for the same ingestion throughput.
add import timeout for header response (first bytes received), and fix retry of import init

2023-12-11T19:05:45.117Z error VictoriaMetrics/app/vmctl/backoff/backoff.go:56 got error: import request failed: unexpected error when performing request: Post "http://gx-vmetrics2-victoria-metrics-single-server:8428/api/v1/import/native": net/http: timeout awaiting response headers on attempt: 1; will retry in 2s

belm0 on Dec 11, 2023

Hello @belm0!

But retries aren’t reported-- is something wrong with the retry logic?

My bad! I didn’t check it properly. Apologies.

While you bear with us on this, could you try using the following patch?

--- a/app/vmctl/vm_native.go
+++ b/app/vmctl/vm_native.go
@@ -125,13 +125,9 @@ func (p *vmNativeProcessor) runSingle(ctx context.Context, f native.Filter, srcU
        }
 
        pr, pw := io.Pipe()
-       done := make(chan struct{})
+       importCh := make(chan error)
        go func() {
-               defer func() { close(done) }()
-               if err := p.dst.ImportPipe(ctx, dstURL, pr); err != nil {
-                       logger.Errorf("error initialize import pipe: %s", err)
-                       return
-               }
+               importCh <- p.dst.ImportPipe(ctx, dstURL, pr)
        }()
 
        w := io.Writer(pw)
@@ -153,9 +149,8 @@ func (p *vmNativeProcessor) runSingle(ctx context.Context, f native.Filter, srcU
        if err := pw.Close(); err != nil {
                return err
        }
-       <-done
-
-       return nil
+       // after writer has been closed, we wait for ImportPipe to finish
+       return <-importCh
 }

this should ensure retries if you got err at p.dst.ImportPipe. I tested it locally and it worked good.

hagen1778 on Dec 11, 2023

It’s a case with two metrics having long GotFirstResponseByte delay (4m, 10m).

vmctl_trace_out_3.txt

2023-12-05T21:53:37.610083498Z,"519 : 2023-12-05 21:42:44	""{__name__=\""r1_sit_tilt_check_fail_count_total\""}"": starting"
2023-12-05T21:53:37.610083498Z,"519 : 2023-12-05 21:42:45		""{__name__=\""r1_sit_tilt_check_fail_count_total\""}""==>GetConn"
2023-12-05T21:53:37.610083498Z,"519 : 2023-12-05 21:42:45		""{__name__=\""r1_sit_tilt_check_fail_count_total\""}""==>GotConn: {Conn:0xc00011e690 Reused:true WasIdle:true IdleTime:186.651124ms}"
2023-12-05T21:53:37.610083498Z,"519 : 2023-12-05 21:42:45		""{__name__=\""r1_sit_tilt_check_fail_count_total\""}""==>WroteHeaders"
2023-12-05T21:53:37.610083498Z,"519 : 2023-12-05 21:42:45		""{__name__=\""r1_sit_tilt_check_fail_count_total\""}""==>WroteRequest; err: {Err:<nil>}"
2023-12-05T21:53:37.610083498Z,"519 : 2023-12-05 21:52:54		""{__name__=\""r1_sit_tilt_check_fail_count_total\""}""==>GotFirstResponseByte"
2023-12-05T21:53:37.610083498Z,"519 : 2023-12-05 21:52:54	""{__name__=\""r1_sit_tilt_check_fail_count_total\""}"": export took 351.002125ms; import took 10m9.65208981s; delay: 10m9.301087685s; bytesWritten: 5.3 MB"

belm0 on Dec 5, 2023

So it could be that one metric contains like 10 series, and another 1e6 series. The time taken to migrate them will be very different. But it is expected that in the source TSDB there is no a huge unbalance, and there will be enough of heavy metric names to load all concurrent threads in vmctl.

My point however is that the variance in request durations may induce the problem somehow. If we’re talking about trying to reproduce the problem of my environment, I think the shape of the data should match (or actually use our data).

(I’m open to using Slack if it’s better for the rapid interaction.)

belm0 on Dec 13, 2023

FYI, this may be related to the issue in Go - golang/go#63795 . This issue has been fixed in Go1.21.5 . The next release of VictoriaMetrics components will use this Go version.

I don’t think it’s that specific bug, since vmctl and the victoria metrics server are running on kernel 5.15 (kernel < 5.1 is required for the bug).

By the way, we’re using victoria metrics server version 1.93.5.

belm0 on Dec 6, 2023

I added instrumentation to ~ExportPipe()~ ImportPipe(), and have found a few strange things.

Here is the log from a run of 2 day range, which had 13 min. no-ingestion period at the end.

observations:

ingestion stops at the final “Request complete” message (logging from runSingle())
waiting for all ~ExportPipe()~ ImportPipe() calls to finish is what causes the no-ingestion period, even though the corresponding requests have already completed (runSingle() reached <-done statement)
some of the ~ExportPipe()~ ImportPipe() durations greatly exceed the runSingle() duration. For example, container_open_sockets (22 min. vs 40 min.)
some of the ~ExportPipe()~ ImportPipe() calls causing the exit delay correspond to relatively small requests. For example, request for node_hwmon_pwm_target_temp was only 128 MB and 1 min. duration (yet ImportPipe() ran for 16 min.)

Especially (3) and (4) seem odd, do you have any explanation?

...
2023/11/30 05:59:02 Request complete ({__name__="container_open_sockets"}), 1.9 GB bytes, duration 22m23.584451459s
...
2023/11/30 06:05:53 Request complete ({__name__="node_hwmon_pwm_target_temp"}), 128.7 MB bytes, duration 1m10.736706706s
...
2023/11/30 06:10:05 Request complete ({__name__="node_network_receive_drop_total"}), 165.7 MB bytes, duration 1m23.876080877s
...
2023/11/30 06:15:22 Request complete ({__name__="node_scrape_collector_success"}), 533.0 MB bytes, duration 4m54.504617736s
...
2023/11/30 06:16:39 Request complete ({__name__="node_systemd_service_restart_total"}), 1.3 GB bytes, duration 6m2.586585396s
2023/11/30 06:16:51 Request complete ({__name__="node_systemd_unit_start_time_seconds"}), 1.6 GB bytes, duration 6m10.991207637s
2023/11/30 06:16:55 ImportPipe({__name__="node_systemd_service_restart_total"}) complete in 6m13.171096283s
2023/11/30 06:17:05 ImportPipe({__name__="container_open_sockets"}) complete in 40m15.393500968s
2023/11/30 06:17:12 ImportPipe({__name__="node_network_receive_drop_total"}) complete in 8m30.450047972s
2023/11/30 06:17:29 ImportPipe({__name__="node_systemd_unit_start_time_seconds"}) complete in 6m41.804505092s
2023/11/30 06:20:23 ImportPipe({__name__="node_hwmon_pwm_target_temp"}) complete in 15m40.548279139s
2023/11/30 06:29:22 ImportPipe({__name__="node_scrape_collector_success"}) complete in 18m52.947625871s
2023/11/30 06:29:22 Import finished!
2023/11/30 06:29:22 VictoriaMetrics importer stats:
  time spent while importing: 52m50.101286323s;
  total bytes: 115.2 GB;
  bytes/s: 36.3 MB;
  requests: 1186;
  requests retries: 0;
2023/11/30 06:29:22 Total time: 52m50.102963452s

diff --git a/app/vmctl/native/client.go b/app/vmctl/native/client.go
index bd22bc6b7..d4d1fb8f9 100644
--- a/app/vmctl/native/client.go
+++ b/app/vmctl/native/client.go
@@ -5,7 +5,9 @@ import (
 	"encoding/json"
 	"fmt"
 	"io"
+	"log"
 	"net/http"
+	"time"
 
 	"github.com/VictoriaMetrics/VictoriaMetrics/app/vmctl/auth"
 )
@@ -71,12 +73,12 @@ func (c *Client) Explore(ctx context.Context, f Filter, tenantID string) ([]stri
 }
 
 // ImportPipe uses pipe reader in request to process data
-func (c *Client) ImportPipe(ctx context.Context, dstURL string, pr *io.PipeReader) error {
+func (c *Client) ImportPipe(ctx context.Context, dstURL string, filterMatch string, pr *io.PipeReader) error {
 	req, err := http.NewRequestWithContext(ctx, http.MethodPost, dstURL, pr)
 	if err != nil {
 		return fmt.Errorf("cannot create import request to %q: %s", c.Addr, err)
 	}
+	startTime := time.Now()
 	importResp, err := c.do(req, http.StatusNoContent)
 	if err != nil {
 		return fmt.Errorf("import request failed: %s", err)
@@ -84,6 +86,7 @@ func (c *Client) ImportPipe(ctx context.Context, dstURL string, pr *io.PipeReade
 	if err := importResp.Body.Close(); err != nil {
 		return fmt.Errorf("cannot close import response body: %s", err)
 	}
+	log.Printf("ImportPipe(%s) complete in %v", filterMatch, time.Since(startTime))
 	return nil
 }
 
diff --git a/app/vmctl/vm_native.go b/app/vmctl/vm_native.go
index 31bdb6b38..97e7cd021 100644
--- a/app/vmctl/vm_native.go
+++ b/app/vmctl/vm_native.go
@@ -114,6 +114,7 @@ func (p *vmNativeProcessor) do(ctx context.Context, f native.Filter, srcURL, dst
 }
 
 func (p *vmNativeProcessor) runSingle(ctx context.Context, f native.Filter, srcURL, dstURL string, bar *pb.ProgressBar) error {
+	startTime := time.Now()
 	reader, err := p.src.ExportPipe(ctx, srcURL, f)
 	if err != nil {
 		return fmt.Errorf("failed to init export pipe: %w", err)
@@ -128,7 +129,7 @@ func (p *vmNativeProcessor) runSingle(ctx context.Context, f native.Filter, srcU
 	done := make(chan struct{})
 	go func() {
 		defer func() { close(done) }()
-		if err := p.dst.ImportPipe(ctx, dstURL, pr); err != nil {
+		if err := p.dst.ImportPipe(ctx, dstURL, f.Match, pr); err != nil {
 			logger.Errorf("error initialize import pipe: %s", err)
 			return
 		}
@@ -153,6 +154,10 @@ func (p *vmNativeProcessor) runSingle(ctx context.Context, f native.Filter, srcU
 	if err := pw.Close(); err != nil {
 		return err
 	}
+	log.Printf("Request complete (%s), %s bytes, duration %v",
+						 f.Match,
+						 byteCountSI(int64(written)),
+						 time.Since(startTime))
 	<-done
 
 	return nil

belm0 on Nov 30, 2023

Can you check that the requests were finished for import and export? I think the goroutines finishes their work.

Would you clarify what I should check?

As I mentioned, in the progress display of vmctl, it stayed at request 2762 of 2762 for over 30 minutes, during which time there were no ingested points.

And it doesn’t happen on every run, only sometimes.

Sorry, I missed times in the graph and in the vmctl logs. I will try to check on our installation and will see it is a bug or not.

dmitryk-dk on Nov 22, 2023