milvus: [Bug]: Search iterator results are sometimes missing

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20230620-247f1170
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0.dev75
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Search iterator results are sometimes missing. I don’t set radius and range_filter. Is it related to default radius?

  1. search with expression:
>       assert len(set(pk_list)) == 1000
E       assert 993 == 1000
E         +993
E         -1000
  1. search without expression:
>       assert len(set(pk_list)) == default_nb
E       assert 2999 == 3000
E         +2999
E         -3000

Expected Behavior

return all results

Steps To Reproduce

1. create a collection
2. insert 3000 data
3. create index L2, load
*4. use expression to chose 1000 data
5. search iterator 
6. count the total results

Milvus Log

@pytest.mark.tags(CaseLabel.L1)
    @pytest.mark.parametrize("metrics", ct.float_metrics[:1])
    def test_search_iterator_with_expression(self, metrics):
        """
        target: test search iterator normal
        method: 1. search iterator
                2. check the result, expect pk not repeat and meet the expr requirements
        expected: search successfully
        """
        # 1. initialize with data
        limit = 100
        dim = 128
        collection_w = self.init_collection_general(prefix, True, dim=dim, is_index=False)[0]
        collection_w.create_index(field_name, {"metric_type": metrics})
        collection_w.load()
        # 2. search iterator
        search_params = {"metric_type": metrics}
        expression = "1000.0 <= float < 2000.0"
        search_iterator = collection_w.search_iterator(vectors[:1], field_name, search_params,
                                                       limit, output_fields=['float'], expr=expression)[0]
        # 3. check the result
        page_idx = 0
        pk_list = []
        while True:
            res = search_iterator.next()
            if len(res[0]) == 0:
                log.info("search iteration finished, close")
                search_iterator.close()
                break
            for i in range(len(res[0])):
                # log.info(res[0][i])
                pk_list.append(res[0][i].id)
            page_idx += 1
        log.info(len(pk_list))
        assert len(set(pk_list)) == 1000

Anything else?

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

image i extract id not in [1000,2000), and try to query all data。

class TestSearchIterator(TestcaseBase):
    """ Test case of search iterator """
    @pytest.mark.tags(CaseLabel.L2)
    @pytest.mark.parametrize("metrics", ct.float_metrics[:1])
    def test_search_iterator_with_expression(self, metrics):
        """
        target: test search iterator normal
        method: 1. search iterator
                2. check the result, expect pk not repeat and meet the expr requirements
        expected: search successfully
        """
        # 1. initialize with data
        limit = 100
        dim = 128
        collection_w = self.init_collection_general(prefix, True, dim=dim, is_index=False)[0]
        collection_w.create_index(field_name, {"metric_type": metrics})
        collection_w.load()
        # 2. search iterator
        search_params = {"metric_type": metrics}
        expression = "1000.0 <= float < 2000.0"
        search_iterator = collection_w.search_iterator(vectors[:1], field_name, search_params,
                                                       limit, output_fields=['float'], expr=expression)[0]
        # 3. check the result
        page_idx = 0
        pk_list = []
        while True:
            res = search_iterator.next()
            if len(res[0]) == 0:
                log.info("search iteration finished, close")
                search_iterator.close()
                break
            
            for i in range(len(res[0])):
                pk_list.append(res[0][i].id)
            page_idx += 1
        log.info(len(pk_list))
        notExistId=[]
        pk_set = set(pk_list)
        if len(pk_set) == 1000:
            return

        for i in range(1000,2000):
            if i not in pk_set:
                notExistId.append(i)

        print("find not exist id", notExistId)
        expression = f"float not in {notExistId} "
        # print("search with expression", expression, 'limit', limit, 'param', search_param , 'name',field_name)
        # res =collection_w.search(vectors[:1], field_name, search_params, limit, expr=expression)
        # print('search result:',res[0])

        print("query with expression", expression)
        res = collection_w.query(expression, output_fields=['float'])
        pk_list = []
        for i in range(len(res[0])):
            pk_list.append(res[0][i]['int64'])
        # print(res[0][-1])
        se = set(pk_list)
        for i in range(0,2000):
            if i not in se:
                print(f'{i} not in query result')
        assert False

Any reasons?

I suppose the difference between two segments code is that the code from @NicoYuan1986 searching the growing segment. So the final result is generated by violent calculation without using HNSW. While code inside e2e test used HNSW as expected. I got the following log inside querynode.log

[2023/06/25 16:42:13.982 +08:00] [INFO] [segments/search.go:94] [“search growing/sealed segments without indexes”] [traceID=9340d5430619b939f618f915e907779f] [segmentIDs=“[442410230948028121]”]

Try larger dataset and result might be different

It’s true that hnsw can overlook minor set of the result. But the same code I mentioned can get 1000 items every time, whereas the code pytest will get 993~997 items every time. Although we can enlarge the dataset and verify that range search may leave some items, we still cannot explain this phenomenon.

NicoYuan1986 has tested various variables and the only left possibility is the DataFrame used by pytest

Hello @NicoYuan1986 , i test in your pr #25039 and pass ,can you test my pr again 😃 milvus-io/pymilvus#1551

Wow! good job