milvus: [Bug]: datacoord abnormal restart, followed by continuous restart by all datanode

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.2.10
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

When the collection data continues to increase, currently at 3k, and the number of entities is very small, the datacoord does not know why it will restart abnormally. Afterwards, the datanode will continue to restart, and the above problem will recur every period of time

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

datacoord

[2023/07/01 07:23:56.510 +00:00] [ERROR] [datacoord/handler.go:369] [“datacoord ServerHandler HasCollection finally failed”] [stack=“github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:369\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:396\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:874\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).CleanupAndReassign\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:815\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:638\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:680”] [2023/07/01 07:23:56.510 +00:00] [FATAL] [logutil/logutil.go:134] [panic] [recover=“"datacoord ServerHandler HasCollection finally failed"”] [stack=“github.com/milvus-io/milvus/internal/util/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/logutil.go:134\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:838\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:370\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:396\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:874\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).CleanupAndReassign\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:815\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:638\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:680”]

dataNode

:00] [INFO] [datanode/channel_meta.go:435] ["Successfully load pk stats"] [segmentID=442508822576174980] [time=7.25618ms] [size=12]
[2023/07/01 07:34:38.571 +00:00] [INFO] [datanode/channel_meta.go:217] ["adding segment"] [type=Flushed] [segmentID=442508822576253842] [collectionID=442508822576253786] [partitionID=442508822576253787] [channel=by-dev-rootcoord-dml_4_442508822576253786v1] [startPosition=<nil>] [endPosition=<nil>] [recoverTs=442550673594384447] [importing=false]
[2023/07/01 07:34:38.571 +00:00] [INFO] [datanode/channel_meta.go:374] ["begin to init pk bloom filter"] [segmentID=442508822576253842] [statsBinLogsLen=1]
time="2023-07-01T07:34:38Z" level=error msg="[Failed to create consumer]" consumerID=1607 error="server error: ConsumerBusy: Exclusive consumer is already connected" name=jmpmn subscription=by-dev-dataNode-706-by-dev-rootcoord-dml_4_442531371927515286v1 topic="persistent://public/default/by-dev-rootcoord-dml_4"
time="2023-07-01T07:34:38Z" level=error msg="[Failed to create consumer]" consumerID=1607 error="server error: ConsumerBusy: Exclusive consumer is already connected" name=jmpmn subscription=by-dev-dataNode-706-by-dev-rootcoord-dml_4_442531371927515286v1 topic="persistent://public/default/by-dev-rootcoord-dml_4"
[2023/07/01 07:34:38.571 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="server error: ConsumerBusy: Exclusive consumer is already connected"]
panic: Failed to create consumer by-dev-rootcoord-dml_4, error = All attempts results:
attempt #1:server error: ConsumerBusy: Exclusive consumer is already connected


goroutine 6989315 [running]:
github.com/milvus-io/milvus/internal/mq/msgstream.(*MqTtMsgStream).AsConsumer(0xc0254e1400, {0xc03a0e8490?, 0x1, 0x3?}, {0xc012cb9a00, 0x3f}, 0x2)
	/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/mq_msgstream.go:651 +0x285
github.com/milvus-io/milvus/internal/datanode.newDmInputNode({0x4651790, 0xc004be3e80}, 0xc0177c9020, 0xc0279405a0)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_dmstream_input_node.go:52 +0x2af
github.com/milvus-io/milvus/internal/datanode.(*dataSyncService).initNodes(0xc0261d9420, 0xc06cb3dad0, 0xc01670f320)
	/go/src/github.com/milvus-io/milvus/internal/datanode/data_sync_service.go:291 +0x1b05
github.com/milvus-io/milvus/internal/datanode.newDataSyncService({0x4651790?, 0xc0004fc140?}, 0xc01caa3000, 0xc01670f4a0, {0x467b7f0, 0xc00711c780}, {0x4647700?, 0xc0682eba20}, {0x7f0794aed3d8, 0xc0004e6810}, ...)
	/go/src/github.com/milvus-io/milvus/internal/datanode/data_sync_service.go:112 +0x3d3
github.com/milvus-io/milvus/internal/datanode.(*flowgraphManager).addAndStart(0xc0004fc380, 0xc000fefc20, 0xc06cb3dad0, 0x1?, 0x1?)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_manager.go:113 +0x3ba
github.com/milvus-io/milvus/internal/datanode.(*DataNode).handlePutEvent(0xc000fefc20, 0xc0279aeeb0, 0x1)
	/go/src/github.com/milvus-io/milvus/internal/datanode/data_node.go:406 +0x365
github.com/milvus-io/milvus/internal/datanode.(*channelEventManager).Run.func1()
	/go/src/github.com/milvus-io/milvus/internal/datanode/event_manager.go:72 +0x128
created by github.com/milvus-io/milvus/internal/datanode.(*channelEventManager).Run

Anything else?

This issue is very frustrating for us. It didn’t happen when the number of collections was very small. It’s strange. Is there a problem with our usage posture? Currently, most of the more than 3k collections are in a loaded state, and there is no active release in the code. Will this have an impact

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 38 (21 by maintainers)

Most upvoted comments

The previous default value was 256, and then we found that for many users, this was too large and there was a waste of resources, so we adjusted this value down. You can adjust this value according to your needs.

I will write a bash script to do this tomorrow