VictoriaMetrics: vminsert: issues of reroute mechanism when storage node is temporarily unavailable

Describe the bug Once one(some) of the storage node(s) get killed, our vm cluster will lose the ability to ingest data into storage, and it is hard to recover if we don’t get involved. The behaviour is:

high rerouted_from rate, not only from the down nodes, but also some other healthy nodes(basically all the nodes)
slow data ingestion rate. The normal rate is about 25k per node but the minimum rate is actually under 1k per node. This is the graph of one node, others are all the same.
high index_write rate, more than 50X normal

It’s not a official panel from the grafana vm-cluster dasboard, the MetricQL is: sum(rate(vm_rows{type="indexdb"}[10m]))
High IO usage and CPU usage: Not sure if it is the cause or a phenomenon.

PS: We have over 100 vmstorage nodes, and have a life-keeper process to re-launch the storage node immediately after it gets killed(within 1m).

Since the high rate of rerouted_from and index_write, we assume that maybe it is caused by reroute in vminsert, This is the hypothesis based on our cases: We have two main reasons which get vmstoage killed:

manually kill vmstorage or other processes’s high memory usage gets vmstorage OOM
some slow queries increase vmstorages’ memory usage, and it gets OOM

After the storage node is down, vminsert reroutes data to other healthy nodes, and the new data increase the resource usage including IO, CPU, and the data ingestion of the healthy nodes get slow, so the reroute mechanism reroute data to any other healthy node, and boom, it causes avalanche！！！

To re-produce the situation, we build a cluster and use some other methods to keep the high IO and CPU usage, in the meanwhile, we scrape parts of our prod data into the cluster by vmagent, every is fine until we shut down one of our nodes, and the situation described above shown up.

In order to prove our hypothesis, we update the code to let vminsert stop rerouting the data from storage-x but still rerouting data from other nodes(we simply drop the data instead of rerouting). We’ve done two operations here, 18:00 at storage-6 and 18:07 at storage-5. As you can see, after I shut down and restart storage-6 at 18:00, every thing seems fine because I stop the reroute from storage-6. But the same situation comes up when I shut down and restart vm-5 at 18:07.

Version v1.39.4-cluster

About this issue

Original URL
State: open
Created 4 years ago
Comments: 15 (6 by maintainers)

Commits related to this issue

app/vminsert: remove useless delays when sending data to vmstorage This improves the maximum data ingestion performance for cluster VictoriaMetrics Updates https://github.com/VictoriaMetrics/Victori... — committed to VictoriaMetrics/VictoriaMetrics by valyala 4 years ago
app/vminsert: refresh the list of healthy storage nodes only if the the row cannot be sent to destination storage node Previously the list had been generated for each rerouted row. This could consume... — committed to VictoriaMetrics/VictoriaMetrics by valyala 4 years ago
app/vminsert: add `-disableRerouting` command-line flag for disabling re-routing if some vmstorage nodes have lower performance than the others Refactor the rerouting mechanism and make it more resil... — committed to VictoriaMetrics/VictoriaMetrics by valyala 3 years ago
app/vminsert/netstorage: tune re-routing algorithm Do not re-route data to unavailable storage node. Send it to the remaining storage nodes instead even if they cannot keep up with the load. This sho... — committed to VictoriaMetrics/VictoriaMetrics by valyala 3 years ago
app/vminsert/netstorage: disable rerouting by default Production clusters work more stable with the disabled rerouting during rolling restarts and/or during spikes in time series churn rate. So it wo... — committed to VictoriaMetrics/VictoriaMetrics by valyala 3 years ago
app/vminsert/netstorage: disable rerouting by default Production clusters work more stable with the disabled rerouting during rolling restarts and/or during spikes in time series churn rate. So it wo... — committed to VictoriaMetrics/VictoriaMetrics by valyala 3 years ago

Most upvoted comments

Reasonable 😃 But still, the reroute algorithm kinda perform worse if we hit the worst case(everytime in our env). Anyway it’s a great solution in our case, we’ll keep searching for a more efficient solution though. Thanks for the reply.

dxtrzhang on Dec 18, 2020

maybe vmstorage should recalculate hash to determine which vmstorage the request should be reroute to, so that all other vmstorage share the burden of one offline vmstorage?

This is how it works right now - if certain vmstorage node is temporarily unavailable, then all the incoming data for this node is spread across all the remaining vmstorage nodes. See https://github.com/VictoriaMetrics/VictoriaMetrics/blob/1ee5a234dcfb83f8457f1ede1cbe5197db4a7c42/app/vminsert/netstorage/netstorage.go#L607-L612 for details.

valyala on Dec 17, 2020

What I am trying to say in this issue is that the rerouting mechanism in vminsert maybe not perform well as expected in actual prod env with heavy pressure. And If this situation happens, It is hard to recover if we don’t get involved.
What we do to recover:

stop the vminsert and let data gets pending at vmagent.
start vminsert one by one after the IO/CPU usage, index_write rate and ingestion rate back to normal.
sometimes we decrease the instance amount of vmselect

This surely makes vm service unavailable.

Any ideas about this issue? @valyala

If you need any details, I’m available on slack. I 'll also update the info to this issue.

dxtrzhang on Sep 27, 2020