milvus: [Bug]: The result of search iterator is not equal with inserted entity number with FLAT index and (COSINE or L2) metric type

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20230921-206cc14d
- Deployment mode(standalone or cluster):both
- MQ type(rocksmq, pulsar or kafka):  all  
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.0.post1.dev13
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The result of search iterator is not equal with inserted entity number with FLAT index and COSINE metric type

Expected Behavior

The result of search iterator is equal with inserted entity number with FLAT index and COSINE metric type

Steps To Reproduce

from pymilvus import CollectionSchema, FieldSchema
from pymilvus import Collection
from pymilvus import connections
from pymilvus import DataType
from pymilvus import Partition
from pymilvus import utility
import time
import numpy as np
import random
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

connections.connect(host="***", port="19530")
dim = 768
int64_field = FieldSchema(name="int64", dtype=DataType.INT64, is_primary=True)
float_field = FieldSchema(name="float", dtype=DataType.FLOAT)
bool_field = FieldSchema(name="bool", dtype=DataType.BOOL)
float_vector = FieldSchema(name="float_vector", dtype=DataType.FLOAT_VECTOR, dim=dim)
schema = CollectionSchema(fields=[int64_field, float_field, bool_field, float_vector])
collection = Collection("test_search_iterator_6", schema=schema)
import random
nb = 1000000
vectors = [[random.random() for _ in range(dim)] for _ in range(nb)]
insert_batch_num = 10000
for i in range(int(nb/insert_batch_num)):
    batch_vectors = vectors[i*insert_batch_num:(i+1)*insert_batch_num]
    print("get %d vectors" % len(batch_vectors))
    res = collection.insert([[i for i in range(i*insert_batch_num, (i+1)*insert_batch_num)], [np.float32(i) for i in range(i*insert_batch_num, (i+1)*insert_batch_num)], [np.bool_(i) for i in range(i*insert_batch_num, (i+1)*insert_batch_num)], batch_vectors])
    print("inserted %d %d " % (i, insert_batch_num))
index_param = {"index_type": "FLAT", "metric_type": "COSINE"}
collection.create_index("float_vector", index_param, index_name="index_name")
collection.load()
default_search_params = {"metric_type": "COSINE"}
collection.flush()
time.sleep(30)
collection.num_entities
batch_size = 10000
search_iterator = collection.search_iterator(vectors[:1], "float_vector", default_search_params, batch_size=batch_size)
page_idx = 0
distance_struct_array = []
while True:
     res = search_iterator.next()
     if len(res) == 0:
           print("search iteration finished, close")
           search_iterator.close()
           break
     print(len(res))
     page_idx += 1
     print(f"page{page_idx}-------------------------")
     for i in range(len(res)):
            distance_struct_array.append({'id': res[i].id, 'distance': res[i].distance})

print(len(distance_struct_array))
print(distance_struct_array[:100])
assert len(distance_struct_array)==nb
collection.drop()

Milvus Log

No response

Anything else?

No response

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 29 (29 by maintainers)

Commits related to this issue

Most upvoted comments

I found that the result got by sdk is not strictly sorted by distance, there may be some flaws on reduce logic. SDF7K4mLlU