milvus: [Bug]: Datacoord restart with error `Failed to create consumer by-dev-rootcoord-dml_111` when running `test_e2e.py` after datanode pod failure chaos
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: master-20220601-f5bd519e
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.1.0.dev66
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
When running test_e2e.py
, there was an error CreateCollection failed: data coord watch channels failed, reason = DataCoord 16 is not ready
, then check the status of Milvus, you can find that datacoord was not ready.
test-datanode-pod-failure-milvus-datacoord-7cb764f6f9-85dcj 0/1 Running 2 16m
But, before running test_e2e.py
, it was ok.
The error message in the previous log of datacoord is
[2022/06/01 19:51:19.075 +00:00] [FATAL] [logutil.go:134] [panic] [recover="\"Failed to create consumer by-dev-rootcoord-dml_111, error = All attempts results:\\nattempt #1:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #2:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #3:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #4:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #5:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #6:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #7:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #8:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #9:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #10:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #11:server error: ConsumerBusy: Exclusive consumer is already connected\\n\""] [stack="github.com/milvus-io/milvus/internal/util/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/logutil.go:134\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:965\ngithub.com/milvus-io/milvus/internal/mq/msgstream.(*mqMsgStream).AsConsumerWithPosition\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/mq_msgstream.go:174\ngithub.com/milvus-io/milvus/internal/mq/msgstream.(*mqMsgStream).AsConsumer\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/mq_msgstream.go:131\ngithub.com/milvus-io/milvus/internal/mq/msgstream.UnsubscribeChannels\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/msgstream_util.go:35\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).cleanUpAndDelete\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:612\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:582\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:663"]
Expected Behavior
all test cases passed
Steps To Reproduce
see https://github.com/milvus-io/milvus/runs/6696486769?check_suite_focus=true
Milvus Log
Anything else?
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 29 (28 by maintainers)
Commits related to this issue
- Fix datacoord mistaken release a new registered node See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Fix datacoord set wrong state for node registering See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Fix datacoord set wrong state for node registering See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Fix datacoord set wrong state for node registering Fix datacoord datarace See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Fix datacoord set wrong state for node registering (#17376) Fix datacoord datarace See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to milvus-io/milvus by XuanYang-cn 2 years ago
- Cannel timers while adding new timer See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Cancel timers while adding new timer See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Cancel timers while adding new timer (#17511) See also: #17335 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to milvus-io/milvus by XuanYang-cn 2 years ago
The problem is when
Seek
takes so long, how do we define the system’s behaviour.After discussing with @jiaoew1991 and @congqixia, we agreed that this would be unrecoverable automatically and would need people involved. Currently:
Seek
is unavailable, leading to collection abnormal.Here’re some solutions from the meeting:
And we still have these problems:
Seek
in metrics?