milvus: [Bug]: datacoord abnormal restart, followed by continuous restart by all datanode
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: v2.2.10
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory:
- GPU:
- Others:
Current Behavior
When the collection data continues to increase, currently at 3k, and the number of entities is very small, the datacoord does not know why it will restart abnormally. Afterwards, the datanode will continue to restart, and the above problem will recur every period of time
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
datacoord
[2023/07/01 07:23:56.510 +00:00] [ERROR] [datacoord/handler.go:369] [“datacoord ServerHandler HasCollection finally failed”] [stack=“github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:369\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:396\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:874\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).CleanupAndReassign\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:815\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:638\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:680”] [2023/07/01 07:23:56.510 +00:00] [FATAL] [logutil/logutil.go:134] [panic] [recover=“"datacoord ServerHandler HasCollection finally failed"”] [stack=“github.com/milvus-io/milvus/internal/util/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/logutil.go:134\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:838\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:370\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:396\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:874\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).CleanupAndReassign\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:815\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:638\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:680”]
dataNode
:00] [INFO] [datanode/channel_meta.go:435] ["Successfully load pk stats"] [segmentID=442508822576174980] [time=7.25618ms] [size=12]
[2023/07/01 07:34:38.571 +00:00] [INFO] [datanode/channel_meta.go:217] ["adding segment"] [type=Flushed] [segmentID=442508822576253842] [collectionID=442508822576253786] [partitionID=442508822576253787] [channel=by-dev-rootcoord-dml_4_442508822576253786v1] [startPosition=<nil>] [endPosition=<nil>] [recoverTs=442550673594384447] [importing=false]
[2023/07/01 07:34:38.571 +00:00] [INFO] [datanode/channel_meta.go:374] ["begin to init pk bloom filter"] [segmentID=442508822576253842] [statsBinLogsLen=1]
time="2023-07-01T07:34:38Z" level=error msg="[Failed to create consumer]" consumerID=1607 error="server error: ConsumerBusy: Exclusive consumer is already connected" name=jmpmn subscription=by-dev-dataNode-706-by-dev-rootcoord-dml_4_442531371927515286v1 topic="persistent://public/default/by-dev-rootcoord-dml_4"
time="2023-07-01T07:34:38Z" level=error msg="[Failed to create consumer]" consumerID=1607 error="server error: ConsumerBusy: Exclusive consumer is already connected" name=jmpmn subscription=by-dev-dataNode-706-by-dev-rootcoord-dml_4_442531371927515286v1 topic="persistent://public/default/by-dev-rootcoord-dml_4"
[2023/07/01 07:34:38.571 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="server error: ConsumerBusy: Exclusive consumer is already connected"]
panic: Failed to create consumer by-dev-rootcoord-dml_4, error = All attempts results:
attempt #1:server error: ConsumerBusy: Exclusive consumer is already connected
goroutine 6989315 [running]:
github.com/milvus-io/milvus/internal/mq/msgstream.(*MqTtMsgStream).AsConsumer(0xc0254e1400, {0xc03a0e8490?, 0x1, 0x3?}, {0xc012cb9a00, 0x3f}, 0x2)
/go/src/github.com/milvus-io/milvus/internal/mq/msgstream/mq_msgstream.go:651 +0x285
github.com/milvus-io/milvus/internal/datanode.newDmInputNode({0x4651790, 0xc004be3e80}, 0xc0177c9020, 0xc0279405a0)
/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_dmstream_input_node.go:52 +0x2af
github.com/milvus-io/milvus/internal/datanode.(*dataSyncService).initNodes(0xc0261d9420, 0xc06cb3dad0, 0xc01670f320)
/go/src/github.com/milvus-io/milvus/internal/datanode/data_sync_service.go:291 +0x1b05
github.com/milvus-io/milvus/internal/datanode.newDataSyncService({0x4651790?, 0xc0004fc140?}, 0xc01caa3000, 0xc01670f4a0, {0x467b7f0, 0xc00711c780}, {0x4647700?, 0xc0682eba20}, {0x7f0794aed3d8, 0xc0004e6810}, ...)
/go/src/github.com/milvus-io/milvus/internal/datanode/data_sync_service.go:112 +0x3d3
github.com/milvus-io/milvus/internal/datanode.(*flowgraphManager).addAndStart(0xc0004fc380, 0xc000fefc20, 0xc06cb3dad0, 0x1?, 0x1?)
/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_manager.go:113 +0x3ba
github.com/milvus-io/milvus/internal/datanode.(*DataNode).handlePutEvent(0xc000fefc20, 0xc0279aeeb0, 0x1)
/go/src/github.com/milvus-io/milvus/internal/datanode/data_node.go:406 +0x365
github.com/milvus-io/milvus/internal/datanode.(*channelEventManager).Run.func1()
/go/src/github.com/milvus-io/milvus/internal/datanode/event_manager.go:72 +0x128
created by github.com/milvus-io/milvus/internal/datanode.(*channelEventManager).Run
Anything else?
This issue is very frustrating for us. It didn’t happen when the number of collections was very small. It’s strange. Is there a problem with our usage posture? Currently, most of the more than 3k collections are in a loaded state, and there is no active release in the code. Will this have an impact
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 38 (21 by maintainers)
The previous default value was 256, and then we found that for many users, this was too large and there was a waste of resources, so we adjusted this value down. You can adjust this value according to your needs.
I will write a bash script to do this tomorrow