milvus: [Bug]: [chaos]Search failed with two replicas when one group is available, but another one is not

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: master-20220424-5c663004
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1.0.dev33
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Search works well in the first assert but raises errors in the second assert.

[2022-04-25 16:55:03 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=no shard leaders available for channel: by-dev-rootcoord-dml_1_432767386375097665v1, leaders: [8 9], err: fail to Search, QueryNode ID=9, reason=SharcCluster for by-dev-rootcoord-dml_1_432767386375097665v1 replicaID 432767088036937731 is no available)>, <Time:{'RPC start': '2022-04-25 16:55:03.240156', 'RPC error': '2022-04-25 16:55:03.429824'}> (decorators.py:73)

after this chaos test, the load operation will timeout for any other collections, even if they are created after chaos.

[2022-04-25 17:05:43 - ERROR - pymilvus.decorators]: grpc RpcError: [load_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-04-25 17:05:23.598932', 'gRPC error': '2022-04-25 17:05:43.603717'}> (decorators.py:81)
[2022-04-25 17:06:04 - ERROR - pymilvus.decorators]: grpc RpcError: [load_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-04-25 17:05:44.608748', 'gRPC error': '2022-04-25 17:06:04.612788'}> (decorators.py:81)
[2022-04-25 17:06:25 - ERROR - pymilvus.decorators]: grpc RpcError: [load_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-04-25 17:06:05.618389', 'gRPC error': '2022-04-25 17:06:25.623273'}> (decorators.py:81)
[2022-04-25 17:06:25 - ERROR - ci_test]: Traceback (most recent call last):

Expected Behavior

When we have multi replicas, search should still work when one of replicas is not available

Steps To Reproduce

1. deploy Milvus with 5 querynode
2. init a collection, insert data, load with 2 replicas, and search
3. first assert: expect all search requests success
4. inject pod failure to one querynode, then one group is not available
5. second assert: expect all search requests success because there is still one group can offer service
6. delete chaos and wait querynode ready
7. third assert: expect all search requests success

Anything else?

log: test-multi-replicas-04-25-17-20.zip

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 20 (20 by maintainers)

Most upvoted comments

@XuanYang-cn please take a look. /assign @XuanYang-cn

A new try with master-20220426-80ae6de3 server log: test-multi-replicas-04-26-14-05.zip

client log: client_log.log

before chaos

[2022-04-26 11:58:44,630 - INFO - ci_test]: replicas_info for collection Checker__z4Sv1na3: Replica groups:
- Group: <group_id:432785403215872003>, <group_nodes:(2, 7)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_432785455185395713v0>, <shard_leader:2>, <shard_nodes:[2, 7]>, Shard: <channel_name:by-dev-rootcoord-dml_1_432785455185395713v1>, <shard_leader:7>, <shard_nodes:[7, 2]>]>
- Group: <group_id:432785403215872002>, <group_nodes:(1, 6, 8)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_432785455185395713v0>, <shard_leader:8>, <shard_nodes:[8, 1]>, Shard: <channel_name:by-dev-rootcoord-dml_1_432785455185395713v1>, <shard_leader:1>, <shard_nodes:[1]>]> (test_chaos_multi_replicas.py:188)

during chaos

[2022-04-26 12:01:05,020 - INFO - ci_test]: replicas_info for collection Checker__z4Sv1na3: Replica groups:
- Group: <group_id:432785403215872002>, <group_nodes:(1, 6, 8)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_432785455185395713v0>, <shard_leader:8>, <shard_nodes:[8, 1]>, Shard: <channel_name:by-dev-rootcoord-dml_1_432785455185395713v1>, <shard_leader:1>, <shard_nodes:[1]>]>
- Group: <group_id:432785403215872003>, <group_nodes:(2, 7, 13)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_432785455185395713v0>, <shard_leader:2>, <shard_nodes:[2, 7]>, Shard: <channel_name:by-dev-rootcoord-dml_1_432785455185395713v1>, <shard_leader:7>, <shard_nodes:[7, 2]>]> (test_chaos_multi_replicas.py:212)

after chaos

[2022-04-26 12:03:48,465 - INFO - ci_test]: replicas_info for collection Checker__z4Sv1na3: Replica groups:
- Group: <group_id:432785403215872002>, <group_nodes:(1, 6, 8)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_432785455185395713v0>, <shard_leader:8>, <shard_nodes:[8, 1]>, Shard: <channel_name:by-dev-rootcoord-dml_1_432785455185395713v1>, <shard_leader:1>, <shard_nodes:[1]>]>
- Group: <group_id:432785403215872003>, <group_nodes:(2, 7, 13)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_432785455185395713v0>, <shard_leader:2>, <shard_nodes:[2, 7]>, Shard: <channel_name:by-dev-rootcoord-dml_1_432785455185395713v1>, <shard_leader:7>, <shard_nodes:[7, 2]>]> (test_chaos_multi_replicas.py:242)

This was fixed with #16653 (60f7fef），which is newer than the tested commit.

yah01 on Apr 28, 2022