milvus: [Bug]: [chaos]Search failed with two replicas when one group is available, but another one is not
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: master-20220424-5c663004
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1.0.dev33
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Search works well in the first assert but raises errors in the second assert.
[2022-04-25 16:55:03 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=no shard leaders available for channel: by-dev-rootcoord-dml_1_432767386375097665v1, leaders: [8 9], err: fail to Search, QueryNode ID=9, reason=SharcCluster for by-dev-rootcoord-dml_1_432767386375097665v1 replicaID 432767088036937731 is no available)>, <Time:{'RPC start': '2022-04-25 16:55:03.240156', 'RPC error': '2022-04-25 16:55:03.429824'}> (decorators.py:73)
after this chaos test, the load operation will timeout for any other collections, even if they are created after chaos.
[2022-04-25 17:05:43 - ERROR - pymilvus.decorators]: grpc RpcError: [load_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-04-25 17:05:23.598932', 'gRPC error': '2022-04-25 17:05:43.603717'}> (decorators.py:81)
[2022-04-25 17:06:04 - ERROR - pymilvus.decorators]: grpc RpcError: [load_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-04-25 17:05:44.608748', 'gRPC error': '2022-04-25 17:06:04.612788'}> (decorators.py:81)
[2022-04-25 17:06:25 - ERROR - pymilvus.decorators]: grpc RpcError: [load_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-04-25 17:06:05.618389', 'gRPC error': '2022-04-25 17:06:25.623273'}> (decorators.py:81)
[2022-04-25 17:06:25 - ERROR - ci_test]: Traceback (most recent call last):
Expected Behavior
When we have multi replicas, search should still work when one of replicas is not available
Steps To Reproduce
1. deploy Milvus with 5 querynode
2. init a collection, insert data, load with 2 replicas, and search
3. first assert: expect all search requests success
4. inject pod failure to one querynode, then one group is not available
5. second assert: expect all search requests success because there is still one group can offer service
6. delete chaos and wait querynode ready
7. third assert: expect all search requests success
Anything else?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (20 by maintainers)
This was fixed with #16653 (60f7fef),which is newer than the tested commit.