milvus: [Bug]: [benchmark] diskann index load failed after inserting 10 million data and reported: "failed to load segment: follower 3 failed to load segment"

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.2.0-20230704-3e055a52
- Deployment mode(standalone or cluster):cluster 
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task : fouramf-stable-test-1688594400, id : 1 case : test_concurrent_locust_diskann_dql_cluster

server:

  fouramf-stable-test-1688594400-1-etcd-0                           1/1     Running     0               93m     10.104.20.86    4am-node22   <none>           <none>
fouramf-stable-test-1688594400-1-etcd-1                           1/1     Running     0               93m     10.104.1.26     4am-node10   <none>           <none>
fouramf-stable-test-1688594400-1-etcd-2                           1/1     Running     0               93m     10.104.22.5     4am-node26   <none>           <none>
fouramf-stable-test-1688594400-1-milvus-datacoord-ddb6859btzbpv   1/1     Running     1 (89m ago)     93m     10.104.6.56     4am-node13   <none>           <none>
fouramf-stable-test-1688594400-1-milvus-datanode-598f877bf98wf9   1/1     Running     1 (89m ago)     93m     10.104.21.238   4am-node24   <none>           <none>
fouramf-stable-test-1688594400-1-milvus-indexcoord-57bcb98mc67h   1/1     Running     1 (89m ago)     93m     10.104.22.243   4am-node26   <none>           <none>
fouramf-stable-test-1688594400-1-milvus-indexnode-5f56b4fcrhxz8   1/1     Running     0               93m     10.104.17.237   4am-node23   <none>           <none>
fouramf-stable-test-1688594400-1-milvus-proxy-5d949d565d-vnncr    1/1     Running     1 (89m ago)     93m     10.104.20.75    4am-node22   <none>           <none>
fouramf-stable-test-1688594400-1-milvus-querycoord-5b5b4c9jhd6n   1/1     Running     1 (89m ago)     93m     10.104.1.7      4am-node10   <none>           <none>
fouramf-stable-test-1688594400-1-milvus-querynode-5f9b5769jr4gr   1/1     Running     0               93m     10.104.22.244   4am-node26   <none>           <none>
fouramf-stable-test-1688594400-1-milvus-rootcoord-7f9d7bdfrzb64   1/1     Running     1 (89m ago)     93m     10.104.22.242   4am-node26   <none>           <none>
fouramf-stable-test-1688594400-1-minio-0                          1/1     Running     0               93m     10.104.1.27     4am-node10   <none>           <none>
fouramf-stable-test-1688594400-1-minio-1                          1/1     Running     0               93m     10.104.22.254   4am-node26   <none>           <none>
fouramf-stable-test-1688594400-1-minio-2                          1/1     Running     0               93m     10.104.6.84     4am-node13   <none>           <none>
fouramf-stable-test-1688594400-1-minio-3                          1/1     Running     0               93m     10.104.20.88    4am-node22   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-bookie-0                  1/1     Running     0               93m     10.104.1.24     4am-node10   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-bookie-1                  1/1     Running     0               93m     10.104.6.81     4am-node13   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-bookie-2                  1/1     Running     0               93m     10.104.22.9     4am-node26   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-bookie-init-gb8rq         0/1     Completed   0               93m     10.104.20.74    4am-node22   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-broker-0                  1/1     Running     0               93m     10.104.6.55     4am-node13   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-proxy-0                   1/1     Running     0               93m     10.104.1.6      4am-node10   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-pulsar-init-whcgq         0/1     Completed   0               93m     10.104.20.73    4am-node22   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-recovery-0                1/1     Running     0               93m     10.104.20.76    4am-node22   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-zookeeper-0               1/1     Running     0               93m     10.104.6.69     4am-node13   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-zookeeper-1               1/1     Running     0               92m     10.104.9.108    4am-node14   <none>           <none>
fouramf-stable-test-1688594400-1-pulsar-zookeeper-2               1/1     Running     0               90m     10.104.22.20    4am-node26   <none>           <none>

client error log:

[2023-07-05 22:16:15,318 -  INFO - fouram]: [Time] Collection.insert run in 1.5832s (api_request.py:45)
[2023-07-05 22:16:15,321 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_cObhOhM0): 9900000 (base.py:469)
[2023-07-05 22:16:15,389 -  INFO - fouram]: [Base] Total time of insert: 371.475s, average number of vector bars inserted per second: 26919.712, average time to insert 50000 vectors per time: 1.8574s (base.py:380)
[2023-07-05 22:16:15,390 -  INFO - fouram]: [Base] Start flush collection fouram_cObhOhM0 (base.py:278)
[2023-07-05 22:16:17,917 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:442)
[2023-07-05 22:16:17,918 -  INFO - fouram]: [Base] Start release collection fouram_cObhOhM0 (base.py:289)
[2023-07-05 22:16:17,920 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_cObhOhM0, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:428)
[2023-07-05 23:05:03,235 -  INFO - fouram]: [Time] Index run in 2925.3138s (api_request.py:45)
[2023-07-05 23:05:03,236 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 2925.3138s (common_cases.py:96)
[2023-07-05 23:05:03,243 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:442)
[2023-07-05 23:05:03,243 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-07-05 23:05:03,244 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-07-05 23:05:03,246 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_cObhOhM0): 10000000 (base.py:469)
[2023-07-05 23:05:03,246 -  INFO - fouram]: [Base] Start load collection fouram_cObhOhM0,replica_number:1,kwargs:{} (base.py:284)
[2023-07-05 23:35:05,770 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
 => load disk index failed)>, <Time:{'RPC start': '2023-07-05 23:35:05.749071', 'RPC error': '2023-07-05 23:35:05.770020'}> (decorators.py:108)
[2023-07-05 23:35:05,771 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
 => load disk index failed)>, <Time:{'RPC start': '2023-07-05 23:05:03.269945', 'RPC error': '2023-07-05 23:35:05.771797'}> (decorators.py:108)
[2023-07-05 23:35:05,771 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
 => load disk index failed)>, <Time:{'RPC start': '2023-07-05 23:05:03.246733', 'RPC error': '2023-07-05 23:35:05.771915'}> (decorators.py:108)
[2023-07-05 23:35:05,773 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
 => load disk index failed)> (api_request.py:53)
[2023-07-05 23:35:05,773 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
 => load disk index failed)> (func_check.py:52)

memory usage: image

Expected Behavior

load success

Steps To Reproduce

1. create a collection or use an existing collection
        2. build index on vector column
        3. insert a certain number of vectors
        4. flush collection
        5. build index on vector column with the same parameters
        6. build index on on scalars column or not
        7. count the total number of rows
        8. load collection    ==> fail

Milvus Log

partial querynode error log :

image

Anything else?

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

max_nr = max_ctx * max_events

If Milvus does not set the AIO pool size, the default max_nr is 65536, which is equal to the kernel default max_aio_nr. When there are multiple query nodes on the same physical machine, and each query node has its own AIO context pool, #QueryNode * 65536 > 65536, since docker shares kernel, the io_setup will fail when loads.

The previous fix sets max_ctx to num_threads (4 * #cpu), and max_events to 32, so on a machine with 16 CPUs, 65536/(16*4*32) = 32. It can have 32 query nodes, which is far enough.

When GetVectorByIds in DiskANN was introduced on the master branch, to maximize the performance, the default max_events is set to 128.

It has been shown in the previous comment, kernel side aio-max-nr has been set from 65536 to 10485760, 10485760/(16*4*128) = 2048, which is also enough, so closing this issue.