milvus: [Bug]: Search failed when we insert data into Milvus, both datanode and querynode keep crashing

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.3.3
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  kafka  
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.2
- OS(Ubuntu or CentOS): AWS Linux
- CPU/Memory:12 * 8C64G for query node, 1 * 4C8G for data node.
- GPU: No GPU
- Others:

Current Behavior

We started a 2.2.3 cluster with 12 * 8C64GB query nodes to store a big vector set, but we found the cluster is not stable.

The search will fail when writing data into Milvus, the failed message is like failed to search: leader not available: lastHB=2023-11-21 01:55:07.499389221 +0000 UTC: node=989: node offline: channel=by-dev-rootcoord-dml_9_445714102417550498v0: channel not available.
Most of the query nodes keep failing, the data node also crashes frequently, the memory usage of the nodes changed very quickly, see pictures below.

Expected Behavior

We expect the search can work normally when we’re inserting data.

Steps To Reproduce

No response

Milvus Log

milvus-log.tar.gz

Anything else?

No response

About this issue

Original URL
State: open
Created 7 months ago
Comments: 42 (20 by maintainers)

Commits related to this issue

fix: Fix kafka config type error (#28642) issue https://github.com/milvus-io/milvus/issues/28588 --------- Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com> — committed to milvus-io/milvus by jiaoew1991 7 months ago
fix: disable reset kafka connection timeout (#28681) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <enwei.jiao@z... — committed to milvus-io/milvus by jiaoew1991 7 months ago
fix: disable reset kafka connection timeout (#28681) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <enwei.jiao@z... — committed to longjiquan/milvus by jiaoew1991 7 months ago
fix: disable reset kafka connection timeout (#28681) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <enwei.jiao@z... — committed to longjiquan/milvus by jiaoew1991 7 months ago
fix: disable reset kafka connection timeout (#28681) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <enwei.jiao@z... — committed to longjiquan/milvus by jiaoew1991 7 months ago
fix: disable reset kafka connection timeout (#28681) (#29286) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <... — committed to milvus-io/milvus by longjiquan 6 months ago

Most upvoted comments

@congguosn Please wait for the upcoming releases of version 2.2.17 or 2.3.4, which will be available in the near future.

czs007 on Dec 21, 2023

@yanliang567 @cndpzc socket.connection.setup.timeout.ms added in this PR https://github.com/milvus-io/milvus/pull/26686/files#diff-708136128f96993332a36f4ae9725c1bd66b50c092c7f5fe39b61be225e5ee24

The issue is that we set a value of a config key to int64, here is timeout, however, the kafka client requires the type to be bool, int, or string. We can disable the config now and change the value type to string or int later.

longjiquan on Dec 18, 2023

@jiaoew1991 any ideas? [2023/11/21 10:38:11.359 +00:00] [ERROR] [kafka/kafka_consumer.go:103] ["create kafka consumer failed"] [topic=by-dev-rootcoord-dml_10] [error="Invalid value type int64 for key socket.connection.setup.timeout.ms (expected string,bool,int,ConfigMap)"]

yanliang567 on Nov 21, 2023

I think the insert throughput is too heavy, that causes 1) datanode is not able to flush the data in time; 2) indexnode is not able to build index for all the segments, and the querynodes have to load growing segments(without index); which causes OOM. My suggestion is

reduce the insert throughput to 20MB/s
add 3 more indexnode(the more cpu cores, the faster in building index)
if you changed the shard_number in collection schema, add more datanodes to make it equals to the shard_number

yanliang567 on Nov 21, 2023

@locustbaby could you quickly investigate on it?

xiaofan-luan on Nov 21, 2023