milvus: [Bug]: [laion-1b] Search 100m-768d collection with 100 concurrent just gets 43 vps

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20230818-74fb244b
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):   pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.4.0.dev109
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. resources and config
  components:
    proxy:
      paused: false
      replicas: 3
      resources:
        limits:
          cpu: "4" 
          memory: 16Gi
        requests:
          cpu: "2" 
          memory: 8Gi 
      serviceType: ClusterIP
    queryNode:
      paused: false
      replicas: 5
      resources:
        limits:
          cpu: "16"
          memory: 128Gi
        requests:
          cpu: 15500m
          memory: 127Gi
  config:
    dataCoord:
      segment:
        expansionRate: 1.15
        maxSize: 4096
        sealProportion: 0.08
    log:
      level: debug
    rootCoord:
      dmlChannelNum: 16
  1. Test steps:
  • create a collection with a int64 pk field and a vector field and other scalar field
Collection schema: {'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 256}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 256}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 256}}]}
  • create hnsw index with params: {‘index_type’: ‘HNSW’, ‘metric_type’: ‘COSINE’, ‘params’: {‘M’: 30, ‘efConstruction’: 360}}
  • insert 100m-768d data
  • create same index again
  • load collection with 1 replica
  • concurrent search with params
'concurrent_params': {'concurrent_number': 100,
                                                       'during_time': '2h',
                                                       'interval': 60,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 10,
                                                       'params': {'nq': 10,
                                                                  'top_k': 100,
                                                                  'search_param': {'ef': 100},
                                                                  'timeout': 6000}}]},
  1. client Test result
'search': {'Requests': 31053,
                'Fails': 0,
                'RPS': 4.32,
                'fail_s': 0.0,
                'RT_max': 44285.58,
                'RT_avg': 22961.36,
                'TP50': 22000.0,
                'TP99': 39000.0}

image

  1. server proxy vps image

  2. segments config segment.maxSize 4096 and expansionRate 1.15, but the compaction result doesn’t look good

[2023-08-22 02:54:51,807 -  INFO - fouram]: [Base] Parser segment info: 
{'segment_counts': 1053,
 'segment_total_vectors': 100000000,
 'max_segment_raw_count': 990077,
 'min_segment_raw_count': 45173,
 'avg_segment_raw_count': 94966.8,
 'std_segment_raw_count': 59884.9,
 'shards_num': 2,
 'truncated_avg_segment_raw_count': 95057.1,
 'truncated_std_segment_raw_count': 59905.9,
 'top_percentile': [{'TP_10': 89727.0},
                    {'TP_20': 89817.0},
                    {'TP_30': 89891.0},
                    {'TP_40': 89950.0},
                    {'TP_50': 90001.0},
                    {'TP_60': 90054.2},
                    {'TP_70': 90115.0},
                    {'TP_80': 90190.0},
                    {'TP_90': 90281.0}]} (base.py:670)
  • server metrics image

Expected Behavior

No response

Steps To Reproduce

4am argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/3f250c89-4cd4-4d8d-bc10-1d60fd980327?nodeId=laion-test-100m-4

grafana link: https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=qa-milvus&var-instance=laion-test-3&var-collection=All&var-app_name=milvus&from=1692599896954&to=1692741479885

Milvus Log

pods:

laion-test-3-etcd-0                                               1/1     Running       0               42h     10.104.14.95    4am-node18   <none>           <none>
laion-test-3-etcd-1                                               1/1     Running       0               42h     10.104.12.76    4am-node17   <none>           <none>
laion-test-3-etcd-2                                               1/1     Running       0               42h     10.104.24.195   4am-node29   <none>           <none>
laion-test-3-milvus-datanode-79bdc9c6c-4bs55                      1/1     Running       0               42h     10.104.20.62    4am-node22   <none>           <none>
laion-test-3-milvus-datanode-79bdc9c6c-b8scm                      1/1     Running       0               42h     10.104.16.169   4am-node21   <none>           <none>
laion-test-3-milvus-indexnode-f4c7dd98c-6lxc9                     1/1     Running       0               42h     10.104.16.170   4am-node21   <none>           <none>
laion-test-3-milvus-indexnode-f4c7dd98c-bd7pl                     1/1     Running       0               42h     10.104.20.63    4am-node22   <none>           <none>
laion-test-3-milvus-indexnode-f4c7dd98c-ftwxk                     1/1     Running       0               42h     10.104.20.64    4am-node22   <none>           <none>
laion-test-3-milvus-indexnode-f4c7dd98c-gctdb                     1/1     Running       0               42h     10.104.15.122   4am-node20   <none>           <none>
laion-test-3-milvus-mixcoord-d795878f7-m4k9h                      1/1     Running       0               42h     10.104.16.171   4am-node21   <none>           <none>
laion-test-3-milvus-proxy-646b4484fd-flr2s                        1/1     Running       0               42h     10.104.16.168   4am-node21   <none>           <none>
laion-test-3-milvus-proxy-646b4484fd-hrfc7                        1/1     Running       0               42h     10.104.15.125   4am-node20   <none>           <none>
laion-test-3-milvus-proxy-646b4484fd-qwc85                        1/1     Running       0               41h     10.104.20.66    4am-node22   <none>           <none>
laion-test-3-milvus-querynode-778d649d78-6dntl                    1/1     Running       0               42h     10.104.15.123   4am-node20   <none>           <none>
laion-test-3-milvus-querynode-778d649d78-886rt                    1/1     Running       0               42h     10.104.15.124   4am-node20   <none>           <none>
laion-test-3-milvus-querynode-778d649d78-shdx9                    1/1     Running       0               42h     10.104.16.172   4am-node21   <none>           <none>
laion-test-3-milvus-querynode-778d649d78-wln95                    1/1     Running       0               42h     10.104.20.65    4am-node22   <none>           <none>
laion-test-3-milvus-querynode-778d649d78-zxvth                    1/1     Running       0               42h     10.104.16.174   4am-node21   <none>           <none>
laion-test-3-minio-0                                              1/1     Running       0               42h     10.104.13.182   4am-node16   <none>           <none>
laion-test-3-minio-1                                              1/1     Running       0               42h     10.104.14.97    4am-node18   <none>           <none>
laion-test-3-minio-2                                              1/1     Running       0               42h     10.104.12.78    4am-node17   <none>           <none>
laion-test-3-minio-3                                              1/1     Running       0               42h     10.104.1.2      4am-node10   <none>           <none>
laion-test-3-pulsar-bookie-0                                      1/1     Running       0               42h     10.104.24.199   4am-node29   <none>           <none>
laion-test-3-pulsar-bookie-1                                      1/1     Running       0               42h     10.104.1.3      4am-node10   <none>           <none>
laion-test-3-pulsar-bookie-2                                      1/1     Running       0               42h     10.104.12.82    4am-node17   <none>           <none>
laion-test-3-pulsar-bookie-init-jkh2k                             0/1     Completed     0               42h     10.104.24.193   4am-node29   <none>           <none>
laion-test-3-pulsar-broker-0                                      1/1     Running       0               42h     10.104.23.166   4am-node27   <none>           <none>
laion-test-3-pulsar-proxy-0                                       1/1     Running       0               42h     10.104.13.180   4am-node16   <none>           <none>
laion-test-3-pulsar-pulsar-init-wr8kc                             0/1     Completed     0               42h     10.104.14.91    4am-node18   <none>           <none>
laion-test-3-pulsar-recovery-0                                    1/1     Running       0               42h     10.104.14.93    4am-node18   <none>           <none>
laion-test-3-pulsar-zookeeper-0                                   1/1     Running       0               42h     10.104.14.98    4am-node18   <none>           <none>
laion-test-3-pulsar-zookeeper-1                                   1/1     Running       0               42h     10.104.12.89    4am-node17   <none>           <none>
laion-test-3-pulsar-zookeeper-2                                   1/1     Running       1 (15h ago)     42h     10.104.13.185   4am-node16   <none>           <none>

Anything else?

No response

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 15 (14 by maintainers)

Commits related to this issue

Most upvoted comments

What about the set the default timeout to 30 min?