milvus: [Bug]: [chaos]Flush hangs with a high probability after datacoord pod recovered from pod kill

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20220330-b6b3c986
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus-2.0.2.dev10
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

from 2022/03/30 03:53:06.325 +00:00 to 2022/03/30 03:59:14.467 +00:00, the output of log in proxy pod all are repeat segments

[2022/03/30 03:59:09.870 +00:00] [INFO] [impl.go:3909] ["received get flush state response"] [response="status:<> "]
[2022/03/30 03:59:10.372 +00:00] [INFO] [impl.go:3895] ["received get flush state request"] [request="segmentIDs:432173687748427778 segmentIDs:432173687748427777 segmentIDs:432173845375614977 segmentIDs:432173845375614978 "]

Expected Behavior

all test cases in verify_all_collections.py can pass.

Steps To Reproduce

see pipeline http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/2406/pipeline

Anything else?

logs: http://10.100.32.144:8080/job/chaos-test/2406/artifact/artifacts-datacoord-pod-kill-2406-server-logs.tar.gz

It reproduced with a high probability(3/4).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 25 (23 by maintainers)

Most upvoted comments

New Facts: Each time, DataNode updates the current total number of rows of the segments, not just the added number of rows.

After some carefully thought, I think GRPC is not a great idea, it’ll break the simplicity of using streaming log(pulsar), this’s why:

  • Using GRPC breaks the message orders in pulsar, cuz num_rows are combined with time ticks, thus need extra logical to take care of the dis-order problem
  • Using GRPC makes DataNode having states, it’s very complexed when considering failover cases.

There’re 2 ways:

  1. make DataCoord subscribe to the correct position of TimeTickChannel, since the updates must be received.
  2. make DataNode be aware of the DataCoord recovery and resend every un-flushed segments updates.
  • 1 shard will re-send updates once.

Just a rough thought:

rpc ReportSegmentsStates(ReportSegmentsStatesRequest) returns ReportSegmentsStatesResponse

message ReportSegmentsStatesRequest {
  uint64 timestamp = 1;
  repeated segmentStates segmentStates = 2;
}

message segmentStates {
    int64 segmentID = 1;
    int num_rows = 2;
}

rpc ReportSegmentStates should be idempotent,num_rows of segmentStates should be the total number of rows.

DataCoord will only updates the segment number of rows if the current ts > timestamp.

  • DataNode will retry this RPC if failed for timestamp ahead
  • DataNode should combine several inserts updates for a single segments and make one rpc if
    • 10 inserts updates
    • 5s since the first update

Needs more carefully thought. But I have some urgent issues now, will come back later.

It failed again in chaos nightly test http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/2441/pipeline

so set this issue as an urgent issue.