milvus: [Bug]: [benchmark] Concurrency stability test, search raise error: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded, flush time is getting longer and longer
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: master-20220513-168935f4
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus-2.1.0.dev39
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
argo task: benchmark-backup-hpw72
test yaml: client-configmap:client-random-locust-compaction server-configmap:server-cluster-8c64m-compaction
server:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
benchmark-backup-hpw72-1-etcd-0 1/1 Running 0 39m 10.97.17.238 qa-node014.zilliz.local <none> <none>
benchmark-backup-hpw72-1-etcd-1 1/1 Running 0 39m 10.97.16.65 qa-node013.zilliz.local <none> <none>
benchmark-backup-hpw72-1-etcd-2 1/1 Running 0 39m 10.97.17.239 qa-node014.zilliz.local <none> <none>
benchmark-backup-hpw72-1-milvus-datacoord-64f99f8646-crb4q 1/1 Running 0 39m 10.97.18.218 qa-node017.zilliz.local <none> <none>
benchmark-backup-hpw72-1-milvus-datanode-6f4b555d4f-h8dzk 1/1 Running 0 39m 10.97.20.74 qa-node018.zilliz.local <none> <none>
benchmark-backup-hpw72-1-milvus-indexcoord-77bbf5ccf9-j459b 1/1 Running 0 39m 10.97.18.217 qa-node017.zilliz.local <none> <none>
benchmark-backup-hpw72-1-milvus-indexnode-55f84ff6b4-4qmkn 1/1 Running 0 39m 10.97.20.75 qa-node018.zilliz.local <none> <none>
benchmark-backup-hpw72-1-milvus-proxy-559878747c-c25tp 1/1 Running 0 39m 10.97.5.20 qa-node003.zilliz.local <none> <none>
benchmark-backup-hpw72-1-milvus-querycoord-59d5bf964-dgtdx 1/1 Running 0 39m 10.97.18.219 qa-node017.zilliz.local <none> <none>
benchmark-backup-hpw72-1-milvus-querynode-77b577f786-rprjw 1/1 Running 0 39m 10.97.19.8 qa-node016.zilliz.local <none> <none>
benchmark-backup-hpw72-1-milvus-rootcoord-784c89554b-96fqd 1/1 Running 0 39m 10.97.5.19 qa-node003.zilliz.local <none> <none>
benchmark-backup-hpw72-1-minio-0 1/1 Running 0 39m 10.97.19.9 qa-node016.zilliz.local <none> <none>
benchmark-backup-hpw72-1-minio-1 1/1 Running 0 39m 10.97.12.116 qa-node015.zilliz.local <none> <none>
benchmark-backup-hpw72-1-minio-2 1/1 Running 0 39m 10.97.12.118 qa-node015.zilliz.local <none> <none>
benchmark-backup-hpw72-1-minio-3 1/1 Running 0 39m 10.97.16.66 qa-node013.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-bookie-0 1/1 Running 0 39m 10.97.12.121 qa-node015.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-bookie-1 1/1 Running 0 39m 10.97.17.242 qa-node014.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-bookie-2 1/1 Running 0 39m 10.97.11.168 qa-node009.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-bookie-init-9r2zz 0/1 Completed 0 39m 10.97.17.227 qa-node014.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-broker-0 1/1 Running 0 39m 10.97.17.235 qa-node014.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-proxy-0 1/1 Running 0 39m 10.97.17.236 qa-node014.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-pulsar-init-9v49h 0/1 Completed 0 39m 10.97.17.226 qa-node014.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-recovery-0 1/1 Running 0 39m 10.97.17.233 qa-node014.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-zookeeper-0 1/1 Running 0 39m 10.97.5.21 qa-node003.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-zookeeper-1 1/1 Running 0 38m 10.97.18.221 qa-node017.zilliz.local <none> <none>
benchmark-backup-hpw72-1-pulsar-zookeeper-2 1/1 Running 0 37m 10.97.5.23 qa-node003.zilliz.local <none> <none>
client pod: benchmark-backup-hpw72-2114025457
locust_report_2022-05-13_979.log
client log:
[2022-05-13 09:50:03,125] [ ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:49:03.124009', 'gRPC error': '2022-05-13 09:50:03.125469'}> (pymilvus.decorators:81)
[2022-05-13 09:50:03,126] [ ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:49:03.117607', 'gRPC error': '2022-05-13 09:50:03.125993'}> (pymilvus.decorators:81)
[2022-05-13 09:51:03,133] [ ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:50:03.131939', 'gRPC error': '2022-05-13 09:51:03.133210'}> (pymilvus.decorators:81)
[2022-05-13 09:51:03,133] [ ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:50:03.126620', 'gRPC error': '2022-05-13 09:51:03.133532'}> (pymilvus.decorators:81)
[2022-05-13 09:52:03,142] [ ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:51:03.141215', 'gRPC error': '2022-05-13 09:52:03.142180'}> (pymilvus.decorators:81)
[2022-05-13 09:52:03,142] [ ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:51:03.134110', 'gRPC error': '2022-05-13 09:52:03.142481'}> (pymilvus.decorators:81)
[2022-05-13 09:53:03,149] [ ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:52:03.148507', 'gRPC error': '2022-05-13 09:53:03.149408'}> (pymilvus.decorators:81)
[2022-05-13 09:53:03,149] [ ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:52:03.143086', 'gRPC error': '2022-05-13 09:53:03.149743'}> (pymilvus.decorators:81)
[2022-05-13 09:54:03,157] [ ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:53:03.155895', 'gRPC error': '2022-05-13 09:54:03.157084'}> (pymilvus.decorators:81)
[2022-05-13 09:54:03,157] [ ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:53:03.150364', 'gRPC error': '2022-05-13 09:54:03.157360'}> (pymilvus.decorators:81)
[2022-05-13 09:59:03,183] [ ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:54:03.180251', 'gRPC error': '2022-05-13 09:59:03.183795'}> (pymilvus.decorators:81)
[2022-05-13 09:59:03,184] [ ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:54:03.161304', 'gRPC error': '2022-05-13 09:59:03.184079'}> (pymilvus.decorators:81)
[2022-05-13 09:59:03,184] [ ERROR] - grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:54:03.176755', 'gRPC error': '2022-05-13 09:59:03.184459'}> (pymilvus.decorators:81)
[2022-05-13 10:04:03,185] [ ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:59:03.186355', 'gRPC error': '2022-05-13 10:04:03.185349'}> (pymilvus.decorators:81)
[2022-05-13 10:04:03,185] [ ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:54:03.177134', 'gRPC error': '2022-05-13 10:04:03.185616'}> (pymilvus.decorators:81)
[2022-05-13 10:04:03,186] [ ERROR] - grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:54:03.176934', 'gRPC error': '2022-05-13 10:04:03.186000'}> (pymilvus.decorators:81)
[2022-05-13 10:08:57,787] [ ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 10:04:03.194410', 'gRPC error': '2022-05-13 10:08:57.787903'}> (pymilvus.decorators:81)
[2022-05-13 10:08:57,788] [ ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-13 09:59:03.185953', 'gRPC error': '2022-05-13 10:08:57.788153'}> (pymilvus.decorators:81)
[2022-05-13 09:36:35,735] [ DEBUG] - Milvus flush run in 5.754s (milvus_benchmark.client:53)
[2022-05-13 09:36:42,265] [ DEBUG] - Milvus flush run in 6.5257s (milvus_benchmark.client:53)
[2022-05-13 09:36:49,482] [ DEBUG] - Milvus flush run in 7.1028s (milvus_benchmark.client:53)
[2022-05-13 09:36:56,261] [ DEBUG] - Milvus flush run in 6.8103s (milvus_benchmark.client:53)
[2022-05-13 09:37:02,750] [ DEBUG] - Milvus flush run in 6.512s (milvus_benchmark.client:53)
[2022-05-13 09:37:09,626] [ DEBUG] - Milvus flush run in 6.8717s (milvus_benchmark.client:53)
[2022-05-13 09:37:17,254] [ DEBUG] - Milvus flush run in 7.5172s (milvus_benchmark.client:53)
[2022-05-13 09:37:30,373] [ DEBUG] - Milvus flush run in 13.114s (milvus_benchmark.client:53)
[2022-05-13 09:37:48,964] [ DEBUG] - Milvus flush run in 17.0714s (milvus_benchmark.client:53)
[2022-05-13 09:38:01,543] [ DEBUG] - Milvus flush run in 24.8865s (milvus_benchmark.client:53)
[2022-05-13 09:38:01,546] [ DEBUG] - Milvus flush run in 24.7389s (milvus_benchmark.client:53)
[2022-05-13 09:38:14,033] [ DEBUG] - Milvus flush run in 30.994s (milvus_benchmark.client:53)
[2022-05-13 09:38:27,493] [ DEBUG] - Milvus flush run in 38.4083s (milvus_benchmark.client:53)
[2022-05-13 09:38:34,883] [ DEBUG] - Milvus flush run in 39.6026s (milvus_benchmark.client:53)
[2022-05-13 09:38:40,944] [ DEBUG] - Milvus flush run in 39.3927s (milvus_benchmark.client:53)
[2022-05-13 09:39:00,373] [ DEBUG] - Milvus flush run in 52.5873s (milvus_benchmark.client:53)
[2022-05-13 09:39:06,871] [ DEBUG] - Milvus flush run in 52.6703s (milvus_benchmark.client:53)
[2022-05-13 09:39:24,568] [ DEBUG] - Milvus flush run in 49.681s (milvus_benchmark.client:53)
[2022-05-13 09:39:24,572] [ DEBUG] - Milvus flush run in 49.6859s (milvus_benchmark.client:53)
[2022-05-13 09:40:01,422] [ DEBUG] - Milvus flush run in 80.5371s (milvus_benchmark.client:53)
[2022-05-13 09:40:01,425] [ DEBUG] - Milvus flush run in 80.3503s (milvus_benchmark.client:53)
[2022-05-13 09:40:21,634] [ DEBUG] - Milvus flush run in 93.8664s (milvus_benchmark.client:53)
[2022-05-13 09:41:09,958] [ DEBUG] - Milvus flush run in 129.6856s (milvus_benchmark.client:53)
[2022-05-13 09:41:10,444] [ DEBUG] - Milvus flush run in 128.8304s (milvus_benchmark.client:53)
[2022-05-13 09:41:40,440] [ DEBUG] - Milvus flush run in 154.0377s (milvus_benchmark.client:53)
[2022-05-13 09:42:10,939] [ DEBUG] - Milvus flush run in 175.2927s (milvus_benchmark.client:53)
[2022-05-13 09:42:45,034] [ DEBUG] - Milvus flush run in 200.3226s (milvus_benchmark.client:53)
[2022-05-13 09:43:22,931] [ DEBUG] - Milvus flush run in 231.9771s (milvus_benchmark.client:53)
[2022-05-13 09:44:03,428] [ DEBUG] - Milvus flush run in 263.1046s (milvus_benchmark.client:53)
[2022-05-13 09:44:45,771] [ DEBUG] - Milvus flush run in 298.2958s (milvus_benchmark.client:53)
[2022-05-13 09:46:22,541] [ DEBUG] - Milvus flush run in 381.8169s (milvus_benchmark.client:53)
[2022-05-13 09:46:24,006] [ DEBUG] - Milvus flush run in 382.93s (milvus_benchmark.client:53)
[2022-05-13 09:47:14,939] [ DEBUG] - Milvus flush run in 426.8938s (milvus_benchmark.client:53)
[2022-05-13 09:48:09,220] [ DEBUG] - Milvus flush run in 474.5191s (milvus_benchmark.client:53)
[2022-05-13 09:49:03,110] [ DEBUG] - Milvus flush run in 521.4714s (milvus_benchmark.client:53)
[2022-05-13 09:54:03,161] [ DEBUG] - Milvus flush run in 814.5621s (milvus_benchmark.client:53)
[2022-05-13 09:54:03,168] [ DEBUG] - Milvus flush run in 808.3961s (milvus_benchmark.client:53)
[2022-05-13 09:54:03,172] [ DEBUG] - Milvus flush run in 801.177s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,194] [ DEBUG] - Milvus flush run in 1270.8543s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,197] [ DEBUG] - Milvus flush run in 1365.4799s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,201] [ DEBUG] - Milvus flush run in 1334.3907s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,204] [ DEBUG] - Milvus flush run in 1305.8575s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,207] [ DEBUG] - Milvus flush run in 1359.403s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,211] [ DEBUG] - Milvus flush run in 1374.7734s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,215] [ DEBUG] - Milvus flush run in 1388.1768s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,219] [ DEBUG] - Milvus flush run in 1381.4467s (milvus_benchmark.client:53)
[2022-05-13 10:04:03,223] [ DEBUG] - Milvus flush run in 1387.9407s (milvus_benchmark.client:53)
[2022-05-13 10:09:13,611] [ DEBUG] - Milvus flush run in 1296.0893s (milvus_benchmark.client:53)
[2022-05-13 10:09:13,617] [ DEBUG] - Milvus flush run in 1544.3275s (milvus_benchmark.client:53)
[2022-05-13 10:09:13,623] [ DEBUG] - Milvus flush run in 1252.7102s (milvus_benchmark.client:53)
[2022-05-13 10:09:13,629] [ DEBUG] - Milvus flush run in 1356.4952s (milvus_benchmark.client:53)
[2022-05-13 10:09:13,634] [ DEBUG] - Milvus flush run in 1458.0787s (milvus_benchmark.client:53)
[2022-05-13 10:09:13,639] [ DEBUG] - Milvus flush run in 1503.0927s (milvus_benchmark.client:53)
[2022-05-13 10:09:13,643] [ DEBUG] - Milvus flush run in 1356.4098s (milvus_benchmark.client:53)
Expected Behavior
No response
Steps To Reproduce
1、create collection
2、build index of ivf_sq8
3、insert one hundred thousand vectors
4、flush collection
5、build index with the same params
6、load collection
7、locust concurrency: query<-search、load、get<-query、scene_insert_delete_flush 《- raise error
Milvus Log
No response
Anything else?
server-cluster-8c64m-compaction:
server:
server_tag: "8c64m"
milvus:
deploy_mode: "cluster"
server_resource:
dataCoordinator:
compaction:
retentionDuration: 1800
client-random-locust-compaction:
locust_random_performance:
collections:
-
collection_name: sift_10w_128_l2
ni_per: 50000
other_fields: float1
build_index: true
index_type: ivf_sq8
index_param:
nlist: 2048
task:
types:
-
type: query
weight: 20
params:
top_k: 10
nq: 10
search_param:
nprobe: 16
filters:
-
range: "{'range': {'float1': {'GT': -1.0, 'LT': collection_size * 0.5}}}"
-
type: load
weight: 1
-
type: get
weight: 10
params:
ids_length: 10
-
type: scene_insert_delete_flush
weight: 1
connection_num: 1
clients_num: 20
spawn_rate: 2
during_time: 30m
@time_wrapper
def scene_insert_delete_flush(self, collection_name=None, vectors=None, ids=None):
if collection_name is None:
collection_name = self._collection_name
logger.debug("[scene_insert_delete_flush] Start scene_insert_delete_flush : %s" % collection_name)
collection_info = self.get_info(collection_name)
entities = utils.generate_entities(collection_info, vectors, ids)
logger.debug("[scene_insert_delete_flush] Start insert : %s" % collection_name)
self.insert(entities, collection_name=collection_name)
logger.debug("[scene_insert_delete_flush] Start delete : %s" % collection_name)
self.delete(expr="id in {}".format(ids[:30000]), collection_name=collection_name, timeout=300)
logger.debug("[scene_insert_delete_flush] Start flush : %s" % collection_name)
self.flush(collection_name=collection_name)
logger.debug("[scene_insert_delete_flush] Finished scene_insert_delete_flush : %s" % collection_name)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 24 (24 by maintainers)
Commits related to this issue
- Remove delete log in datanode because it may influence datanode performance Resolves #16984 Signed-off-by: wayblink <anyang.wang@zilliz.com> — committed to wayblink/milvus by wayblink 2 years ago
- Remove delete log in datanode because it may influence datanode performance (#17415) Resolves #16984 Signed-off-by: wayblink <anyang.wang@zilliz.com> — committed to milvus-io/milvus by wayblink 2 years ago
This sounds reasonable and match what we got——the segments number keeps growing. And I found flush time for each segment grows as well(about 1min30s max). I will do some test to approve this. It can explain all If the flush speed and flushing segment number has an inverse relationship. But what is the bottleneck is still unclear. The CPU, memory usage is not very large.
Yeah, I have done some tests last week. The result shows that this issue no longer exist in our latest master branch. The flush time are now in a reasonable range (in 10s~) and no error occurs. I will double check it today