milvus: [Bug]: [benchmark] diskann index load failed after inserting 10 million data and reported: "failed to load segment: follower 3 failed to load segment"
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version:2.2.0-20230704-3e055a52
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
argo task : fouramf-stable-test-1688594400, id : 1 case : test_concurrent_locust_diskann_dql_cluster
server:
fouramf-stable-test-1688594400-1-etcd-0 1/1 Running 0 93m 10.104.20.86 4am-node22 <none> <none>
fouramf-stable-test-1688594400-1-etcd-1 1/1 Running 0 93m 10.104.1.26 4am-node10 <none> <none>
fouramf-stable-test-1688594400-1-etcd-2 1/1 Running 0 93m 10.104.22.5 4am-node26 <none> <none>
fouramf-stable-test-1688594400-1-milvus-datacoord-ddb6859btzbpv 1/1 Running 1 (89m ago) 93m 10.104.6.56 4am-node13 <none> <none>
fouramf-stable-test-1688594400-1-milvus-datanode-598f877bf98wf9 1/1 Running 1 (89m ago) 93m 10.104.21.238 4am-node24 <none> <none>
fouramf-stable-test-1688594400-1-milvus-indexcoord-57bcb98mc67h 1/1 Running 1 (89m ago) 93m 10.104.22.243 4am-node26 <none> <none>
fouramf-stable-test-1688594400-1-milvus-indexnode-5f56b4fcrhxz8 1/1 Running 0 93m 10.104.17.237 4am-node23 <none> <none>
fouramf-stable-test-1688594400-1-milvus-proxy-5d949d565d-vnncr 1/1 Running 1 (89m ago) 93m 10.104.20.75 4am-node22 <none> <none>
fouramf-stable-test-1688594400-1-milvus-querycoord-5b5b4c9jhd6n 1/1 Running 1 (89m ago) 93m 10.104.1.7 4am-node10 <none> <none>
fouramf-stable-test-1688594400-1-milvus-querynode-5f9b5769jr4gr 1/1 Running 0 93m 10.104.22.244 4am-node26 <none> <none>
fouramf-stable-test-1688594400-1-milvus-rootcoord-7f9d7bdfrzb64 1/1 Running 1 (89m ago) 93m 10.104.22.242 4am-node26 <none> <none>
fouramf-stable-test-1688594400-1-minio-0 1/1 Running 0 93m 10.104.1.27 4am-node10 <none> <none>
fouramf-stable-test-1688594400-1-minio-1 1/1 Running 0 93m 10.104.22.254 4am-node26 <none> <none>
fouramf-stable-test-1688594400-1-minio-2 1/1 Running 0 93m 10.104.6.84 4am-node13 <none> <none>
fouramf-stable-test-1688594400-1-minio-3 1/1 Running 0 93m 10.104.20.88 4am-node22 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-bookie-0 1/1 Running 0 93m 10.104.1.24 4am-node10 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-bookie-1 1/1 Running 0 93m 10.104.6.81 4am-node13 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-bookie-2 1/1 Running 0 93m 10.104.22.9 4am-node26 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-bookie-init-gb8rq 0/1 Completed 0 93m 10.104.20.74 4am-node22 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-broker-0 1/1 Running 0 93m 10.104.6.55 4am-node13 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-proxy-0 1/1 Running 0 93m 10.104.1.6 4am-node10 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-pulsar-init-whcgq 0/1 Completed 0 93m 10.104.20.73 4am-node22 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-recovery-0 1/1 Running 0 93m 10.104.20.76 4am-node22 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-zookeeper-0 1/1 Running 0 93m 10.104.6.69 4am-node13 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-zookeeper-1 1/1 Running 0 92m 10.104.9.108 4am-node14 <none> <none>
fouramf-stable-test-1688594400-1-pulsar-zookeeper-2 1/1 Running 0 90m 10.104.22.20 4am-node26 <none> <none>
client error log:
[2023-07-05 22:16:15,318 - INFO - fouram]: [Time] Collection.insert run in 1.5832s (api_request.py:45)
[2023-07-05 22:16:15,321 - INFO - fouram]: [Base] Number of vectors in the collection(fouram_cObhOhM0): 9900000 (base.py:469)
[2023-07-05 22:16:15,389 - INFO - fouram]: [Base] Total time of insert: 371.475s, average number of vector bars inserted per second: 26919.712, average time to insert 50000 vectors per time: 1.8574s (base.py:380)
[2023-07-05 22:16:15,390 - INFO - fouram]: [Base] Start flush collection fouram_cObhOhM0 (base.py:278)
[2023-07-05 22:16:17,917 - INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:442)
[2023-07-05 22:16:17,918 - INFO - fouram]: [Base] Start release collection fouram_cObhOhM0 (base.py:289)
[2023-07-05 22:16:17,920 - INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_cObhOhM0, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:428)
[2023-07-05 23:05:03,235 - INFO - fouram]: [Time] Index run in 2925.3138s (api_request.py:45)
[2023-07-05 23:05:03,236 - INFO - fouram]: [CommonCases] RT of build index DISKANN: 2925.3138s (common_cases.py:96)
[2023-07-05 23:05:03,243 - INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:442)
[2023-07-05 23:05:03,243 - INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-07-05 23:05:03,244 - INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-07-05 23:05:03,246 - INFO - fouram]: [Base] Number of vectors in the collection(fouram_cObhOhM0): 10000000 (base.py:469)
[2023-07-05 23:05:03,246 - INFO - fouram]: [Base] Start load collection fouram_cObhOhM0,replica_number:1,kwargs:{} (base.py:284)
[2023-07-05 23:35:05,770 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
=> load disk index failed)>, <Time:{'RPC start': '2023-07-05 23:35:05.749071', 'RPC error': '2023-07-05 23:35:05.770020'}> (decorators.py:108)
[2023-07-05 23:35:05,771 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
=> load disk index failed)>, <Time:{'RPC start': '2023-07-05 23:05:03.269945', 'RPC error': '2023-07-05 23:35:05.771797'}> (decorators.py:108)
[2023-07-05 23:35:05,771 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
=> load disk index failed)>, <Time:{'RPC start': '2023-07-05 23:05:03.246733', 'RPC error': '2023-07-05 23:35:05.771915'}> (decorators.py:108)
[2023-07-05 23:35:05,773 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
=> load disk index failed)> (api_request.py:53)
[2023-07-05 23:35:05,773 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=failed to load segment: follower 3 failed to load segment, reason [UnexpectedError] Assert "ok" at /go/src/github.com/milvus-io/milvus/internal/core/src/index/VectorDiskIndex.cpp:71
=> load disk index failed)> (func_check.py:52)
memory usage:
Expected Behavior
load success
Steps To Reproduce
1. create a collection or use an existing collection
2. build index on vector column
3. insert a certain number of vectors
4. flush collection
5. build index on vector column with the same parameters
6. build index on on scalars column or not
7. count the total number of rows
8. load collection ==> fail
Milvus Log
partial querynode error log :
Anything else?
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 21 (21 by maintainers)
max_nr
=max_ctx
*max_events
If Milvus does not set the AIO pool size, the default
max_nr
is65536
, which is equal to the kernel defaultmax_aio_nr
. When there are multiple query nodes on the same physical machine, and each query node has its own AIO context pool,#QueryNode
* 65536 > 65536, since docker shares kernel, theio_setup
will fail when loads.The previous fix sets
max_ctx
tonum_threads
(4 * #cpu), andmax_events
to32
, so on a machine with 16 CPUs, 65536/(16*4*32) = 32. It can have 32 query nodes, which is far enough.When GetVectorByIds in DiskANN was introduced on the master branch, to maximize the performance, the default
max_events
is set to128
.It has been shown in the previous comment, kernel side
aio-max-nr
has been set from 65536 to 10485760, 10485760/(16*4*128) = 2048, which is also enough, so closing this issue.