milvus: [Bug]: [chaos][cluster] Insert data fails when etcd pod recovered running status after been killed

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Insert data still fail when etcd pod recovered running status after been killed. error message

Error: <BaseException: (code=1, message=GetSegmentID failed: SegmentIDAllocator failRemainRequest err:syncSegmentID Failed:server is not serving)>
Traceback (most recent call last):
  File "hello_milvus.py", line 89, in <module>
    hello_milvus()
  File "hello_milvus.py", line 47, in hello_milvus
    collection.insert(
  File "/Users/zilliz/opt/anaconda3/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 525, in insert
    res = conn.insert(collection_name=self._name, entities=entities, ids=None,
  File "/Users/zilliz/opt/anaconda3/lib/python3.8/site-packages/pymilvus/client/stub.py", line 61, in handler
    raise e
  File "/Users/zilliz/opt/anaconda3/lib/python3.8/site-packages/pymilvus/client/stub.py", line 45, in handler
    return func(self, *args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/lib/python3.8/site-packages/pymilvus/client/stub.py", line 931, in insert
    return handler.bulk_insert(collection_name, entities, partition_name, timeout, **kwargs)
  File "/Users/zilliz/opt/anaconda3/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 65, in handler
    raise e
  File "/Users/zilliz/opt/anaconda3/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 57, in handler
    return func(self, *args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 505, in bulk_insert
    raise err
  File "/Users/zilliz/opt/anaconda3/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 501, in bulk_insert
    raise BaseException(response.status.error_code, response.status.reason)
pymilvus.client.exceptions.BaseException: <BaseException: (code=1, message=GetSegmentID failed: SegmentIDAllocator failRemainRequest err:syncSegmentID Failed:server is not serving)>

Expected Behavior

All operations can work well

Steps To Reproduce

1. Deploy milvus by helm: cd `tests/python_client/chaos` && `helm install --wait --timeout 360s milvus-chaos milvus/milvus -f cluster-values.yaml  -n=chaos-testing`
2. Run scripts before chaos: `python helllo_milvus.py` 
3. Delete ectd pod: `kubectl delete pod ${pod_name}`
4. Run scripts after chaos: `python hello_milvus.py`

Environment

- Milvus version: d54f342
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):2.0.0rc8.dev9
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Anything else?

k8s_log.zip

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (19 by maintainers)

Most upvoted comments

If we reinstall milvus (keep pvc), milvus will not report error message. However, milvus will hang at search or get collections entities just like the issue #7313 #10069 of reinstalling milvus. #7313 and #10069 need milvus to keep running a long time or a large size data, but in this test, the issue can be triggered easily in a short time or small size data.

When etcd crashed, the system should be down but is will be recovered in minutes, that’s the expectation, are we meet the expectation? @zhuwenxing

No! The milvus system still can’t work even though etcd has recovered for a long time

zhuwenxing on Nov 11, 2021