milvus: [Bug]: [chaos]Flush hangs with a high probability after datacoord pod recovered from pod kill
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version:master-20220330-b6b3c986
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus-2.0.2.dev10
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
from 2022/03/30 03:53:06.325 +00:00
to 2022/03/30 03:59:14.467 +00:00
, the output of log in proxy pod all are repeat segments
[2022/03/30 03:59:09.870 +00:00] [INFO] [impl.go:3909] ["received get flush state response"] [response="status:<> "]
[2022/03/30 03:59:10.372 +00:00] [INFO] [impl.go:3895] ["received get flush state request"] [request="segmentIDs:432173687748427778 segmentIDs:432173687748427777 segmentIDs:432173845375614977 segmentIDs:432173845375614978 "]
Expected Behavior
all test cases in verify_all_collections.py
can pass.
Steps To Reproduce
see pipeline http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/2406/pipeline
Anything else?
It reproduced with a high probability(3/4).
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 25 (23 by maintainers)
New Facts: Each time, DataNode updates the current total number of rows of the segments, not just the added number of rows.
After some carefully thought, I think GRPC is not a great idea, it’ll break the simplicity of using streaming log(pulsar), this’s why:
There’re 2 ways:
Just a rough thought:
rpc
ReportSegmentStates
should be idempotent,num_rows
of segmentStates should be the total number of rows.DataCoord will only updates the segment number of rows if the current ts > timestamp.
Needs more carefully thought. But I have some urgent issues now, will come back later.
This issue also happened if some other pod was killed, such as datacoord, datanode, rootcoord.
It failed again in chaos nightly test http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/2441/pipeline
so set this issue as an urgent issue.