milvus: [Bug]: milvus-querynode pod hang ,and it would hang again after 20mins running if restart the hung pod

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:V2.1.0
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.2
- OS(Ubuntu or CentOS): Ubuntu 18.04
- CPU/Memory: 6C,94GB
- GPU: 
- Others:

Current Behavior

当前环境属于新装环境，正常运行2天；cpu低于20%；内存低于50%；load average: 7.08 ；milvus库中表20张，3000维度+；整体数据约10万左右；milvus-querynode容器突然挂死，出现故障后重启milvus-querynode容器，正常20分钟后又突然挂死，再次重新拉起暂时正常；

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

业务系统异常信息（python开发，业务信息已经剔除）： 2022-08-03 09:45:58.677 | INFO | data_pack:131 - ********** 2022-08-03 09:46:39.385 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=err: failed to connect 172.27.0.10:21123, reason: context deadline exceeded , /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:445 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard.func1 /go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:55 github.com/milvus-io/milvus/internal/proxy.roundRobinPolicy /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:465 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:324 github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute.func1.1 /go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57 golang.org/x/sync/errgroup.(*Group).Go.func1 /usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit )> 2022-08-03 09:47:15.265 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=All attempts results: attempt #1:fail to get shard leaders from QueryCoord: no replica available attempt #2:fail to get shard leaders from QueryCoord: no replica available attempt #3:fail to get shard leaders from QueryCoord: no replica available attempt #4:fail to get shard leaders from QueryCoord: no replica available attempt #5:fail to get shard leaders from QueryCoord: no replica available attempt #6:fail to get shard leaders from QueryCoord: no replica available attempt #7:fail to get shard leaders from QueryCoord: no replica available attempt #8:context deadline exceeded )> …ERROR… 2022-08-03 09:49:32.845 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=22, reason=ShardCluster for by-dev-rootcoord-dml_5_434981356418629633v1 replicaID 434981333510389770 is no available)>

milvus-querynode 容器异常信息： [2022/08/03 01:46:01.305 +00:00] [DEBUG] [querynode/segment.go:300] [“do search on segment done”] [msgID=435024794290851788] [segmentID=435017738016522241] [segmentType=Growing] [loadIndex=false], terminate called after throwing an instance of ‘faiss::FaissException’, [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“do search done, msgID = 435024794290851796, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [434987259691794433] (170ms)”], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“start reduce search result, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (171ms)”], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:40] [“shard leader get valid search results”] [numbers=2], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] [“result No.”=0] [nq=1] [topk=50], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] [“result No.”=1] [nq=1] [topk=50], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:135] [“skip duplicated search result”] [count=3], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“do search done, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (171ms)”], [2022/08/03 01:46:01.311 +00:00] [DEBUG] [querynode/segment.go:300] [“do search on segment done”] [msgID=435024794290851796] [segmentID=435017738016522241] [segmentType=Growing] [loadIndex=false], [2022/08/03 01:46:01.315 +00:00] [DEBUG] [querynode/segment.go:300] [“do search on segment done”] [msgID=435024794290851788] [segmentID=435017458584125441] [segmentType=Growing] [loadIndex=false], [2022/08/03 01:46:01.321 +00:00] [DEBUG] [querynode/segment.go:300] [“do search on segment done”] [msgID=435024794290851788] [segmentID=434987259691794433] [segmentType=Sealed] [loadIndex=true], [2022/08/03 01:46:01.322 +00:00] [DEBUG] [querynode/segment.go:300] [“do search on segment done”] [msgID=435024794290851796] [segmentID=434987272261861377] [segmentType=Sealed] [loadIndex=true], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“do search done, msgID = 435024794290851788, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [434987259691794433] (374ms)”], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“start reduce search result, msgID = 435024794290851788, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (374ms)”], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:40] [“shard leader get valid search results”] [numbers=2], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] [“result No.”=0] [nq=1] [topk=20], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] [“result No.”=1] [nq=1] [topk=20], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:135] [“skip duplicated search result”] [count=1], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“do search done, msgID = 435024794290851788, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (375ms)”], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“do search done, msgID = 435024794290851796, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [434987272261861377] (187ms)”], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“start reduce search result, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [] (188ms)”], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:40] [“shard leader get valid search results”] [numbers=2], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] [“result No.”=0] [nq=1] [topk=50], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] [“result No.”=1] [nq=1] [topk=50], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:135] [“skip duplicated search result”] [count=3], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] [“do search done, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [] (188ms)”], terminate called recursively,

Anything else?

No response

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19 (18 by maintainers)

Most upvoted comments

This issue is because, except IDMAP and ANNOY, all other knowhere index types are not thread-safe when searching:

BIN_IVF
IVF_FLAT
IVF_PQ
IVF_SQ8
IVF_HNSW
RHNSW_FLAT
RHNSW_PQ
RHNSW_SQ

But in segcore, all searching requests share the same knowhere::VecIndexPtr for one segment. When different searching requests use different searching parameters, for example ‘nprobe’, memory overlaps.

cydrain on Aug 16, 2022