milvus: [Bug]: Search failed when we insert data into Milvus, both datanode and querynode keep crashing

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.3.3
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  kafka  
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.2
- OS(Ubuntu or CentOS): AWS Linux
- CPU/Memory:12 * 8C64G for query node, 1 * 4C8G for data node.
- GPU: No GPU
- Others:

Current Behavior

We started a 2.2.3 cluster with 12 * 8C64GB query nodes to store a big vector set, but we found the cluster is not stable.

  • The search will fail when writing data into Milvus, the failed message is like failed to search: leader not available: lastHB=2023-11-21 01:55:07.499389221 +0000 UTC: node=989: node offline: channel=by-dev-rootcoord-dml_9_445714102417550498v0: channel not available.
  • Most of the query nodes keep failing, the data node also crashes frequently, the memory usage of the nodes changed very quickly, see pictures below.
image image

Expected Behavior

We expect the search can work normally when we’re inserting data.

Steps To Reproduce

No response

Milvus Log

milvus-log.tar.gz

Anything else?

No response

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 42 (20 by maintainers)

Commits related to this issue

Most upvoted comments

@congguosn Please wait for the upcoming releases of version 2.2.17 or 2.3.4, which will be available in the near future.

@yanliang567 @cndpzc socket.connection.setup.timeout.ms added in this PR https://github.com/milvus-io/milvus/pull/26686/files#diff-708136128f96993332a36f4ae9725c1bd66b50c092c7f5fe39b61be225e5ee24

The issue is that we set a value of a config key to int64, here is timeout, however, the kafka client requires the type to be bool, int, or string. We can disable the config now and change the value type to string or int later.

@jiaoew1991 any ideas? [2023/11/21 10:38:11.359 +00:00] [ERROR] [kafka/kafka_consumer.go:103] ["create kafka consumer failed"] [topic=by-dev-rootcoord-dml_10] [error="Invalid value type int64 for key socket.connection.setup.timeout.ms (expected string,bool,int,ConfigMap)"]

I think the insert throughput is too heavy, that causes 1) datanode is not able to flush the data in time; 2) indexnode is not able to build index for all the segments, and the querynodes have to load growing segments(without index); which causes OOM. My suggestion is

  1. reduce the insert throughput to 20MB/s
  2. add 3 more indexnode(the more cpu cores, the faster in building index)
  3. if you changed the shard_number in collection schema, add more datanodes to make it equals to the shard_number

@locustbaby could you quickly investigate on it?