milvus: [Bug]: Milvus stops responding after a couple of hours of light load
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: v2.0.2
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.0.2
- OS(Ubuntu or CentOS): Container-Optimized OS with containerd (cos_containerd) ( GCP k8s cluster)
- CPU/Memory: 4 cpu / 27GB memory
- GPU: No
- Others: Running in GCP Kubernetes cluster version 1.20.15-gke.3400
Current Behavior
I’ve loaded approx 1.4M embeddings of size 512 into a new collection and created a RHNSW_PQ
index. This is then loaded on to 4 query-nodes and I start to loop: For each embedding, search for 100 candidates of most similar.
This seems to work OK up until after a few hours, where it just hangs on fetching new embeddings. This is the log from 1 of 4 query nodes:
l/querynode.(*QueryNode).ReleaseCollection.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:388\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:392\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:287\ngithub.com/milvus-io/milvus/internal/proto/querypb._QueryNode_ReleaseCollection_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:3558\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/querypb._QueryNode_ReleaseCollection_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:3560\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"INFO","time":"2022/05/06 07:38:42.775 +00:00","caller":"grpclog.go:37","message":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""}
{"level":"WARN","time":"2022/05/06 07:39:31.838 +00:00","caller":"query_service.go:174","message":"stopQueryCollection failed, collection doesn't exist","collectionID":432927636982333441}
{"level":"WARN","time":"2022/05/06 07:39:31.839 +00:00","caller":"task_scheduler.go:63","message":"release collection failed, collectionID = 432927636982333441, err = collection hasn't been loaded or has been released, collection id = 432927636982333441"}
{"level":"ERROR","time":"2022/05/06 07:39:31.839 +00:00","caller":"impl.go:388","message":"release collection failed, collectionID = 432927636982333441, err = collection hasn't been loaded or has been released, collection id = 432927636982333441","stack":"github.com/milvus-io/milvus/internal/querynode.(*QueryNode).ReleaseCollection.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:388\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:392\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:287\ngithub.com/milvus-io/milvus/internal/proto/querypb._QueryNode_ReleaseCollection_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:3558\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/querypb._QueryNode_ReleaseCollection_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:3560\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"INFO","time":"2022/05/06 07:40:22.943 +00:00","caller":"grpclog.go:37","message":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""}
{"level":"WARN","time":"2022/05/06 07:41:55.851 +00:00","caller":"query_service.go:174","message":"stopQueryCollection failed, collection doesn't exist","collectionID":432927636982333441}
{"level":"WARN","time":"2022/05/06 07:41:55.851 +00:00","caller":"task_scheduler.go:63","message":"release collection failed, collectionID = 432927636982333441, err = collection hasn't been loaded or has been released, collection id = 432927636982333441"}
{"level":"ERROR","time":"2022/05/06 07:41:55.851 +00:00","caller":"impl.go:388","message":"release collection failed, collectionID = 432927636982333441, err = collection hasn't been loaded or has been released, collection id = 432927636982333441","stack":"github.com/milvus-io/milvus/internal/querynode.(*QueryNode).ReleaseCollection.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:388\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:392\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:287\ngithub.com/milvus-io/milvus/internal/proto/querypb._QueryNode_ReleaseCollection_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:3558\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/querypb._QueryNode_ReleaseCollection_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:3560\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"INFO","time":"2022/05/06 07:42:46.957 +00:00","caller":"grpclog.go:37","message":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""}
{"level":"WARN","time":"2022/05/06 07:45:33.879 +00:00","caller":"query_service.go:174","message":"stopQueryCollection failed, collection doesn't exist","collectionID":432927636982333441}
{"level":"WARN","time":"2022/05/06 07:45:33.879 +00:00","caller":"task_scheduler.go:63","message":"release collection failed, collectionID = 432927636982333441, err = collection hasn't been loaded or has been released, collection id = 432927636982333441"}
{"level":"ERROR","time":"2022/05/06 07:45:33.879 +00:00","caller":"impl.go:388","message":"release collection failed, collectionID = 432927636982333441, err = collection hasn't been loaded or has been released, collection id = 432927636982333441","stack":"github.com/milvus-io/milvus/internal/querynode.(*QueryNode).ReleaseCollection.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:388\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:392\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:287\ngithub.com/milvus-io/milvus/internal/proto/querypb._QueryNode_ReleaseCollection_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:3558\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/querypb._QueryNode_ReleaseCollection_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:3560\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
I don’t know what to do when this happens, but I could also see from the index-coordinator that it errored with
{"level":"INFO","time":"2022/05/06 07:28:38.623 +00:00","caller":"grpclog.go:37","message":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004e0e80, {TRANSIENT_FAILURE connection error: desc = \"transport: Error while dialing dial tcp 10.48.10.6:21121: i/o timeout\"}"}
and there is no pod running with IP 10.48.10.6
, so I thought I would try to restart all the coordinators. When I did this, the querycoord
got stuck in a crash loop:
time="2022-05-06T08:28:05Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" topic="persistent://public/default/milvus-search-0"
panic: Failed to create producer milvus-search-0, error = All attempts results:
attempt #1:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #2:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #3:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #4:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #5:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #6:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #7:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #8:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #9:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #10:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
attempt #11:server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded
goroutine 190 [running]:
github.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).AsProducer(0xc0007f0680, 0xc00063a090, 0x1, 0x1)
/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:124 +0x278
github.com/milvus-io/milvus/internal/querycoord.(*MetaReplica).getQueryStreamByID(0xc0007140f0, 0x60211645fc80001, 0xc0001563f0, 0xf, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/milvus-io/milvus/internal/querycoord/meta.go:869 +0x3e2
github.com/milvus-io/milvus/internal/querycoord.(*MetaReplica).sendSealedSegmentChangeInfos(0xc0007140f0, 0x60211645fc80001, 0xc0001563f0, 0xf, 0xc000848f80, 0xc0005b2700, 0x10, 0x20)
/go/src/github.com/milvus-io/milvus/internal/querycoord/meta.go:641 +0x6a
github.com/milvus-io/milvus/internal/querycoord.(*MetaReplica).removeGlobalSealedSegInfos(0xc0007140f0, 0x60211645fc80001, 0x0, 0x0, 0x0, 0x40, 0xc000565400, 0x203000)
/go/src/github.com/milvus-io/milvus/internal/querycoord/meta.go:583 +0x332
github.com/milvus-io/milvus/internal/querycoord.updateSegmentInfoFromTask(0x36693d0, 0xc0007cc080, 0x3692720, 0xc0007cc800, 0x36913f8, 0xc0007140f0, 0x0, 0x0)
/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:891 +0xacc
github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop(0xc00067cf80)
/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:657 +0xd73
created by github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).Start
/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:867 +0x65
and won’t start back up.
Expected Behavior
I expect Milvus to not hang when I try to fetch embeddings, either via the primary key nor via search. I expect the Milvus cluster to recover if any of the pods gets re-scheduled in a k8s cluster since pods are ephemeral and can for sure get restarted at any time.
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (11 by maintainers)
The main reason is we didn’t unregister search channel and search channel backlog is full.
In 2.1 we remove the search channel so there should not exist similar issues. I want to fix this issue in 2.0.3 also and this should be released this week
Hi friends! when are you planning to release a fixed version for this bug? It I am having the same problem 😕
It seems that 2.0.2 is not production stable, while I can see movement in 2.1- rc.
@xiaofan-luan
You can try to use this command:
You can find the subscription name in the
topics stats
output, such asmilvus-queryNode-432927636982333441-11
I’ve now been able to run
clear-backlog
on all 4 subscriptions found under the topicpersistent://public/default/milvus-search-0
:Then I deleted the
querycoord
pod so that it restarts and now it starts up with the following logsBut it seems to start up “ok” anyways, k8s reports it as ready!