milvus: [Bug]: Datacoord restart with error `Failed to create consumer by-dev-rootcoord-dml_111` when running `test_e2e.py` after datanode pod failure chaos

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: master-20220601-f5bd519e
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.1.0.dev66
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

When running test_e2e.py, there was an error CreateCollection failed: data coord watch channels failed, reason = DataCoord 16 is not ready, then check the status of Milvus, you can find that datacoord was not ready.

test-datanode-pod-failure-milvus-datacoord-7cb764f6f9-85dcj    0/1     Running     2          16m

But, before running test_e2e.py, it was ok.

The error message in the previous log of datacoord is

[2022/06/01 19:51:19.075 +00:00] [FATAL] [logutil.go:134] [panic] [recover="\"Failed to create consumer by-dev-rootcoord-dml_111, error = All attempts results:\\nattempt #1:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #2:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #3:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #4:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #5:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #6:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #7:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #8:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #9:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #10:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #11:server error: ConsumerBusy: Exclusive consumer is already connected\\n\""] [stack="github.com/milvus-io/milvus/internal/util/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/logutil.go:134\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:965\ngithub.com/milvus-io/milvus/internal/mq/msgstream.(*mqMsgStream).AsConsumerWithPosition\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/mq_msgstream.go:174\ngithub.com/milvus-io/milvus/internal/mq/msgstream.(*mqMsgStream).AsConsumer\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/mq_msgstream.go:131\ngithub.com/milvus-io/milvus/internal/mq/msgstream.UnsubscribeChannels\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/msgstream_util.go:35\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).cleanUpAndDelete\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:612\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:582\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:663"]

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/6696486769?check_suite_focus=true

Milvus Log

logs-datanode-pod_failure.zip

Anything else?

No response

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 29 (28 by maintainers)

Commits related to this issue

Fix datacoord mistaken release a new registered node See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Fix datacoord set wrong state for node registering See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Fix datacoord set wrong state for node registering See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Fix datacoord set wrong state for node registering Fix datacoord datarace See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Fix datacoord set wrong state for node registering (#17376) Fix datacoord datarace See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to milvus-io/milvus by XuanYang-cn 2 years ago
Cannel timers while adding new timer See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Cancel timers while adding new timer See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Cancel timers while adding new timer (#17511) See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to milvus-io/milvus by XuanYang-cn 2 years ago

Most upvoted comments

The problem is when Seek takes so long, how do we define the system’s behaviour.

After discussing with @jiaoew1991 and @congqixia, we agreed that this would be unrecoverable automatically and would need people involved. Currently:

DataCoord won’t stop restarting, which should be fixed
The channel being Seek is unavailable, leading to collection abnormal.

Here’re some solutions from the meeting:

We need metrics on Seek with high priority
DataCoord instead of panics itself, tries to mark this channel unavailable(Perhaps in 2.1.1). So that other collections and channels works fine.

And we still have these problems:

What’s next after we mark a channel unavailable?
What can we do to make things better if we find a long Seek in metrics?

XuanYang-cn on Jun 15, 2022