milvus: [Bug]: [chaos]Flush hangs with a high probability after datacoord pod recovered from pod kill

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:master-20220330-b6b3c986
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus-2.0.2.dev10
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

from 2022/03/30 03:53:06.325 +00:00 to 2022/03/30 03:59:14.467 +00:00, the output of log in proxy pod all are repeat segments

[2022/03/30 03:59:09.870 +00:00] [INFO] [impl.go:3909] ["received get flush state response"] [response="status:<> "]
[2022/03/30 03:59:10.372 +00:00] [INFO] [impl.go:3895] ["received get flush state request"] [request="segmentIDs:432173687748427778 segmentIDs:432173687748427777 segmentIDs:432173845375614977 segmentIDs:432173845375614978 "]

Expected Behavior

all test cases in verify_all_collections.py can pass.

Steps To Reproduce

see pipeline http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/2406/pipeline

Anything else?

logs: http://10.100.32.144:8080/job/chaos-test/2406/artifact/artifacts-datacoord-pod-kill-2406-server-logs.tar.gz

It reproduced with a high probability(3/4).

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 25 (23 by maintainers)

Most upvoted comments

New Facts: Each time, DataNode updates the current total number of rows of the segments, not just the added number of rows.

After some carefully thought, I think GRPC is not a great idea, it’ll break the simplicity of using streaming log(pulsar), this’s why:

Using GRPC breaks the message orders in pulsar, cuz num_rows are combined with time ticks, thus need extra logical to take care of the dis-order problem
Using GRPC makes DataNode having states, it’s very complexed when considering failover cases.

There’re 2 ways:

make DataCoord subscribe to the correct position of TimeTickChannel, since the updates must be received.
make DataNode be aware of the DataCoord recovery and resend every un-flushed segments updates.

1 shard will re-send updates once.

XuanYang-cn on May 11, 2022

Just a rough thought:

rpc ReportSegmentsStates(ReportSegmentsStatesRequest) returns ReportSegmentsStatesResponse

message ReportSegmentsStatesRequest {
  uint64 timestamp = 1;
  repeated segmentStates segmentStates = 2;
}

message segmentStates {
    int64 segmentID = 1;
    int num_rows = 2;
}

rpc ReportSegmentStates should be idempotent，num_rows of segmentStates should be the total number of rows.

DataCoord will only updates the segment number of rows if the current ts > timestamp.

DataNode will retry this RPC if failed for timestamp ahead
DataNode should combine several inserts updates for a single segments and make one rpc if
- 10 inserts updates
- 5s since the first update

Needs more carefully thought. But I have some urgent issues now, will come back later.

XuanYang-cn on Apr 27, 2022

This issue also happened if some other pod was killed, such as datacoord, datanode, rootcoord.

zhuwenxing on Apr 11, 2022

It failed again in chaos nightly test http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/2441/pipeline

so set this issue as an urgent issue.

zhuwenxing on Apr 1, 2022