milvus: [Bug]: [flush state] unflushed segment ids loop on restart and non-responsive to gRPC client

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: v2.0.2
- Deployment mode(standalone or cluster): standalone
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.0.1
- OS(Ubuntu or CentOS): Ubuntu and CentOS
- CPU/Memory: 64 GB
- GPU: N/A
- Others:

Current Behavior

I created six collections in my app, and it works well the first time the data are inserted into the index. However, when I restart the app, the application hangs. I checked Milvus, and I found that it seems to show the log below continuously repeatedly:

[2022/04/09 09:02:41.683 +00:00] [INFO] [services.go:954] ["[flush state] unflushed segment ids"] [segmentIDs="[432402286159069187,432400859715338241,432402286159069185,432400859715338243,432400863608438786,432402286159069188,432400859715338244,432402286159069186,432400863621545985,432400863608438785,432400863621545986,432400859715338242,432400863621545987,432400859715338245,432402286159069189]"] [len=15]
[2022/04/09 09:02:41.683 +00:00] [INFO] [impl.go:3924] ["received get flush state response"] [response="status:<> "]
[2022/04/09 09:02:41.835 +00:00] [INFO] [impl.go:3910] ["received get flush state request"]

Eventually, some error is posted on the logs, as can be seen below:

[2022/04/09 09:02:41.683 +00:00] [INFO] [services.go:931] ["received get flush state request"] [segmentIDs="[432402286159069187,432400859715338241,432402286159069185,432400859715338243,432400863608438786,432402286159069188,432400859715338244,432402286159069186,432400863621545985,432400863608438785,432400863621545986,432400859715338242,432400863621545987,432400859715338245,432402286159069189]"] [len=15]
[2022/04/09 09:02:41.683 +00:00] [INFO] [services.go:954] ["[flush state] unflushed segment ids"] [segmentIDs="[432402286159069187,432400859715338241,432402286159069185,432400859715338243,432400863608438786,432402286159069188,432400859715338244,432402286159069186,432400863621545985,432400863608438785,432400863621545986,432400859715338242,432400863621545987,432400859715338245,432402286159069189]"] [len=15]
[2022/04/09 09:02:41.683 +00:00] [INFO] [impl.go:3924] ["received get flush state response"] [response="status:<> "]
[2022/04/09 09:02:41.835 +00:00] [INFO] [impl.go:3910] ["received get flush state request"] [request="segmentIDs:432402286159069187 segmentIDs:432400859715338241 segmentIDs:432402286159069185 segmentIDs:432400859715338243 segmentIDs:432400863608438786 segmentIDs:432402286159069188 segmentIDs:432400859715338244 segmentIDs:432402286159069186 segmentIDs:432400863621545985 segmentIDs:432400863608438785 segmentIDs:432400863621545986 segmentIDs:432400859715338242 segmentIDs:432400863621545987 segmentIDs:432400859715338245 segmentIDs:432402286159069189 "]
[2022/04/09 09:02:41.835 +00:00] [INFO] [services.go:931] ["received get flush state request"] [segmentIDs="[432402286159069187,432400859715338241,432402286159069185,432400859715338243,432400863608438786,432402286159069188,432400859715338244,432402286159069186,432400863621545985,432400863608438785,432400863621545986,432400859715338242,432400863621545987,432400859715338245,432402286159069189]"] [len=15]
[2022/04/09 09:02:41.835 +00:00] [INFO] [services.go:954] ["[flush state] unflushed segment ids"] [segmentIDs="[432402286159069187,432400859715338241,432402286159069185,432400859715338243,432400863608438786,432402286159069188,432400859715338244,432402286159069186,432400863621545985,432400863608438785,432400863621545986,432400859715338242,432400863621545987,432400859715338245,432402286159069189]"] [len=15]
[2022/04/09 09:02:41.836 +00:00] [INFO] [impl.go:3924] ["received get flush state response"] [response="status:<> "]
[2022/04/09 09:02:43.132 +00:00] [INFO] [grpclog.go:37] ["[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""]
[2022/04/09 09:02:43.132 +00:00] [WARN] [impl.go:2710] ["Query failed to WaitToFinish"] [error="proxy TaskCondition context Done"] [traceID=2e560045c8ccdd14] [role=proxy] [MsgID=432403965796155561] [BeginTs=432404904322793478] [EndTs=432404904322793478] [db=] [collection=documents$word2vec_2617e5cf327e60cc8955189110e7f21d] [partitions="[]"]
[2022/04/09 09:02:43.132 +00:00] [ERROR] [task_scheduler.go:501] ["Failed to post-execute task: queryTask:wait to finish failed, timeout : 432403965796155561"] [traceID=2e560045c8ccdd14] [stack="github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:501"]
[2022/04/09 09:02:43.134 +00:00] [INFO] [grpclog.go:37] ["[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""]
[2022/04/09 09:02:43.134 +00:00] [INFO] [grpclog.go:37] ["[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""]
[2022/04/09 09:02:43.170 +00:00] [WARN] [impl.go:2710] ["Query failed to WaitToFinish"] [error="proxy TaskCondition context Done"] [traceID=878e022f4080ba6] [role=proxy] [MsgID=432403965796155551] [BeginTs=432404904283471883] [EndTs=432404904283471883] [db=] [collection=indicators$indicator_854ae5f9cdda093265212c435d1ddfd4] [partitions="[]"]
[2022/04/09 09:02:43.170 +00:00] [INFO] [grpclog.go:37] ["[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""]
[2022/04/09 09:02:43.170 +00:00] [ERROR] [task_scheduler.go:501] ["Failed to post-execute task: queryTask:wait to finish failed, timeout : 432403965796155551"] [traceID=878e022f4080ba6] [stack="github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:501"]

After which, I had to delete the volumes directory; delete the containers of milvus, minio, and etcd; and then generate the vectors again and insert them back into the index.

Expected Behavior

I expect that the Milvus service should be available on startup. And shouldn’t take hours to hang up without the certainty of whether it will respond to clients or not.

Steps To Reproduce

I'm still working on a minimal reproducible example. It regularly hangs on my application but I can't reproduce it yet with a lighter example. However, this is a bottleneck for our application launch.

Anything else?

No response

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 2
Comments: 24 (11 by maintainers)

Most upvoted comments

same case ,any update on this?

Hey we just fixed this and are currently verifying it.

Expect this fix during the upcoming 2.1 release.

soothing-rain on May 26, 2022