milvus: [Bug]: milvus enables standby mode, and when the primary node crashes, an error occurs: role xxxxcoord[nodeID: xxx] is not serving, reason: StandBy.

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.2.10
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2):  pymilvus  2.2.12
- OS(Ubuntu or CentOS): redhat
- CPU/Memory: 16c 64G
- GPU: 
- Others:

Current Behavior

etcd上的coord记录已经切换到备用节点，但是连接到coord的客户端(proxy/indexnoode/datanood等)会连接到之前挂掉又重启的节点上，该节点重启后已经不是主节点了

Expected Behavior

主节点挂掉后，自动切换到备用节点，应用访问正常

Steps To Reproduce

1. 使用k8s 部署，coord deployment绑定固定节点，并且网络模式设定为hostnetwork，确保重启后ip不会改变
2. 所有的coord均开启1主两副本
3. 主动删除coord主节点(会自动重启)
4. 访问应用报错

Milvus Log

io.milvus.client.AbstractMilvusGrpcClient.logError(AbstractMilvusGrpcClient.java:3064):QueryRequest failed:
checkIfLoaded failed when query, collection:test_0803_1, partitions:[], err = GetCollectionInfo failed, collection = test_0803_1, err = role querycoord[nodeID: 117] is not serving, reason: StandBy

Anything else?

猜测： milvus 服务发现的触发机制貌似是 “服务启动” 及 “grpc连接报错” 的时候，如果主节点发生重启，备节点变成主节点，客户端（proxy、indexnode等）仍然会连接到原本的节点上去，grpc连接本身不会报错，但是原节点角色已经由主节点变为备节点，导致应用使用的时候故障

建议：服务发现的机制需要优化一下：除了重启和报错的时候去发现，每隔几秒就应该去etcd中查询，否则对发生主备切换的反应会很迟钝，可能由于各种各样的原因导致连接到备用节点上去。

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (9 by maintainers)

Most upvoted comments

@obailiumingo Nice analysis. I think it is because you bind service to IP, so client won’t get a error when connect to the old IP and won’t switch. We will think about it.

所见略同。服务重启和绑定静态IP有其意义，milvus其他部分的代码还没仔细阅读，除了我遇到的这种情况，也许其他人也会因为别的原因报standby 错误。从机制上分析，提升服务发现敏捷度可能才是彻底解决同类问题的关键。

obailiumingo on Aug 9, 2023

@yanliang567 please help to make a verify for this issue.

bigsheeper on Aug 23, 2023