milvus: [Bug]: Datacoord restart with error `Failed to create consumer by-dev-rootcoord-dml_111` when running `test_e2e.py` after datanode pod failure chaos

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20220601-f5bd519e
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.1.0.dev66
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

When running test_e2e.py, there was an error CreateCollection failed: data coord watch channels failed, reason = DataCoord 16 is not ready, then check the status of Milvus, you can find that datacoord was not ready.

test-datanode-pod-failure-milvus-datacoord-7cb764f6f9-85dcj    0/1     Running     2          16m

But, before running test_e2e.py, it was ok.

The error message in the previous log of datacoord is

[2022/06/01 19:51:19.075 +00:00] [FATAL] [logutil.go:134] [panic] [recover="\"Failed to create consumer by-dev-rootcoord-dml_111, error = All attempts results:\\nattempt #1:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #2:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #3:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #4:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #5:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #6:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #7:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #8:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #9:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #10:server error: ConsumerBusy: Exclusive consumer is already connected\\nattempt #11:server error: ConsumerBusy: Exclusive consumer is already connected\\n\""] [stack="github.com/milvus-io/milvus/internal/util/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/logutil.go:134\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:965\ngithub.com/milvus-io/milvus/internal/mq/msgstream.(*mqMsgStream).AsConsumerWithPosition\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/mq_msgstream.go:174\ngithub.com/milvus-io/milvus/internal/mq/msgstream.(*mqMsgStream).AsConsumer\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/mq_msgstream.go:131\ngithub.com/milvus-io/milvus/internal/mq/msgstream.UnsubscribeChannels\n\t/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/msgstream_util.go:35\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).cleanUpAndDelete\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:612\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:582\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:663"]

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/6696486769?check_suite_focus=true

Milvus Log

logs-datanode-pod_failure.zip

Anything else?

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (28 by maintainers)

Commits related to this issue

Most upvoted comments

The problem is when Seek takes so long, how do we define the system’s behaviour.

After discussing with @jiaoew1991 and @congqixia, we agreed that this would be unrecoverable automatically and would need people involved. Currently:

  1. DataCoord won’t stop restarting, which should be fixed
  2. The channel being Seek is unavailable, leading to collection abnormal.

Here’re some solutions from the meeting:

  1. We need metrics on Seek with high priority
  2. DataCoord instead of panics itself, tries to mark this channel unavailable(Perhaps in 2.1.1). So that other collections and channels works fine.

And we still have these problems:

  1. What’s next after we mark a channel unavailable?
  2. What can we do to make things better if we find a long Seek in metrics?