VictoriaMetrics: vminsert: issues of reroute mechanism when storage node is temporarily unavailable

Describe the bug Once one(some) of the storage node(s) get killed, our vm cluster will lose the ability to ingest data into storage, and it is hard to recover if we don’t get involved. The behaviour is:

  1. high rerouted_from rate, not only from the down nodes, but also some other healthy nodes(basically all the nodes) image

  2. slow data ingestion rate. The normal rate is about 25k per node but the minimum rate is actually under 1k per node. This is the graph of one node, others are all the same. image

  3. high index_write rate, more than 50X normal

    It’s not a official panel from the grafana vm-cluster dasboard, the MetricQL is: sum(rate(vm_rows{type="indexdb"}[10m]))

    image

  4. High IO usage and CPU usage: image image Not sure if it is the cause or a phenomenon.

PS: We have over 100 vmstorage nodes, and have a life-keeper process to re-launch the storage node immediately after it gets killed(within 1m).


Since the high rate of rerouted_from and index_write, we assume that maybe it is caused by reroute in vminsert, This is the hypothesis based on our cases: We have two main reasons which get vmstoage killed:

  1. manually kill vmstorage or other processes’s high memory usage gets vmstorage OOM
  2. some slow queries increase vmstorages’ memory usage, and it gets OOM

After the storage node is down, vminsert reroutes data to other healthy nodes, and the new data increase the resource usage including IO, CPU, and the data ingestion of the healthy nodes get slow, so the reroute mechanism reroute data to any other healthy node, and boom, it causes avalanche!!!


To re-produce the situation, we build a cluster and use some other methods to keep the high IO and CPU usage, in the meanwhile, we scrape parts of our prod data into the cluster by vmagent, every is fine until we shut down one of our nodes, and the situation described above shown up.

In order to prove our hypothesis, we update the code to let vminsert stop rerouting the data from storage-x but still rerouting data from other nodes(we simply drop the data instead of rerouting). We’ve done two operations here, 18:00 at storage-6 and 18:07 at storage-5. As you can see, after I shut down and restart storage-6 at 18:00, every thing seems fine because I stop the reroute from storage-6. But the same situation comes up when I shut down and restart vm-5 at 18:07. image image

Version v1.39.4-cluster

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 15 (6 by maintainers)

Commits related to this issue

Most upvoted comments

Reasonable 😃 But still, the reroute algorithm kinda perform worse if we hit the worst case(everytime in our env). Anyway it’s a great solution in our case, we’ll keep searching for a more efficient solution though. Thanks for the reply.

maybe vmstorage should recalculate hash to determine which vmstorage the request should be reroute to, so that all other vmstorage share the burden of one offline vmstorage?

This is how it works right now - if certain vmstorage node is temporarily unavailable, then all the incoming data for this node is spread across all the remaining vmstorage nodes. See https://github.com/VictoriaMetrics/VictoriaMetrics/blob/1ee5a234dcfb83f8457f1ede1cbe5197db4a7c42/app/vminsert/netstorage/netstorage.go#L607-L612 for details.

What I am trying to say in this issue is that the rerouting mechanism in vminsert maybe not perform well as expected in actual prod env with heavy pressure. And If this situation happens, It is hard to recover if we don’t get involved.
What we do to recover:

  1. stop the vminsert and let data gets pending at vmagent.
  2. start vminsert one by one after the IO/CPU usage, index_write rate and ingestion rate back to normal.
  3. sometimes we decrease the instance amount of vmselect

This surely makes vm service unavailable.

Any ideas about this issue? @valyala

If you need any details, I’m available on slack. I 'll also update the info to this issue.