milvus: [Bug]: Search failed when we insert data into Milvus, both datanode and querynode keep crashing
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: 2.3.3
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.2
- OS(Ubuntu or CentOS): AWS Linux
- CPU/Memory:12 * 8C64G for query node, 1 * 4C8G for data node.
- GPU: No GPU
- Others:
Current Behavior
We started a 2.2.3 cluster with 12 * 8C64GB query nodes to store a big vector set, but we found the cluster is not stable.
- The search will fail when writing data into Milvus, the failed message is like
failed to search: leader not available: lastHB=2023-11-21 01:55:07.499389221 +0000 UTC: node=989: node offline: channel=by-dev-rootcoord-dml_9_445714102417550498v0: channel not available. - Most of the query nodes keep failing, the data node also crashes frequently, the memory usage of the nodes changed very quickly, see pictures below.
Expected Behavior
We expect the search can work normally when we’re inserting data.
Steps To Reproduce
No response
Milvus Log
Anything else?
No response
About this issue
- Original URL
- State: open
- Created 7 months ago
- Comments: 42 (20 by maintainers)
Commits related to this issue
- fix: Fix kafka config type error (#28642) issue https://github.com/milvus-io/milvus/issues/28588 --------- Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com> — committed to milvus-io/milvus by jiaoew1991 7 months ago
- fix: disable reset kafka connection timeout (#28681) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <enwei.jiao@z... — committed to milvus-io/milvus by jiaoew1991 7 months ago
- fix: disable reset kafka connection timeout (#28681) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <enwei.jiao@z... — committed to longjiquan/milvus by jiaoew1991 7 months ago
- fix: disable reset kafka connection timeout (#28681) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <enwei.jiao@z... — committed to longjiquan/milvus by jiaoew1991 7 months ago
- fix: disable reset kafka connection timeout (#28681) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <enwei.jiao@z... — committed to longjiquan/milvus by jiaoew1991 7 months ago
- fix: disable reset kafka connection timeout (#28681) (#29286) pr: https://github.com/milvus-io/milvus/pull/28642 issue https://github.com/milvus-io/milvus/issues/28588 Signed-off-by: Enwei Jiao <... — committed to milvus-io/milvus by longjiquan 6 months ago
@congguosn Please wait for the upcoming releases of version 2.2.17 or 2.3.4, which will be available in the near future.
The issue is that we set a value of a config key to int64, here is timeout, however, the kafka client requires the type to be bool, int, or string. We can disable the config now and change the value type to string or int later.
@jiaoew1991 any ideas?
[2023/11/21 10:38:11.359 +00:00] [ERROR] [kafka/kafka_consumer.go:103] ["create kafka consumer failed"] [topic=by-dev-rootcoord-dml_10] [error="Invalid value type int64 for key socket.connection.setup.timeout.ms (expected string,bool,int,ConfigMap)"]I think the insert throughput is too heavy, that causes 1) datanode is not able to flush the data in time; 2) indexnode is not able to build index for all the segments, and the querynodes have to load growing segments(without index); which causes OOM. My suggestion is
@locustbaby could you quickly investigate on it?