milvus: [Bug]: Create collection hangs and cannot create producer on topic with backlog quota exceeded

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: master-20220316-d4ad785b
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.2.dev5
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy milvus cluster by operator, and all replicas are 1
Run the following script multiple times: create a collection and insert 10 million vectors in multiples. However, during this period, the disk of the node was filled many times, causing many pods to be evicted and restarted. And I’m not sure if I ever dropped the collection.

        host = "10.100.32.xxx"
        port = 31746
        hdf5_source_file = "/Users/nausicca/Downloads/vectors/sift-128-euclidean.hdf5"
        ni = 50000
        nb = 1000000      
        # define field and schema
        collection_w = ApiCollectionWrapper()
        filed_w = ApiFieldSchemaWrapper()
        schema_w = ApiCollectionSchemaWrapper()
        fields = [filed_w.init_field_schema(name="id", dtype=DataType.INT64, is_primary=True)[0],
                  filed_w.init_field_schema(name="vec", dtype=DataType.FLOAT_VECTOR, dim=128)[0]]
        schema = schema_w.init_collection_schema(fields, auto_id=True)[0]

        # create collection
        collection_w.init_collection(name=cf.gen_unique_str("disk"), schema=schema, shards_num=1, timeout=20)
        log.info(collection_w.num_entities)

        dataset = h5py.File(hdf5_source_file)

        # insert
        vectors = np.array(dataset['train'])
        # .astype(np.float(32))
        for i in range(10):
            s = time.time()
            for i in range(nb // ni):
                start = i * ni
                end = (i + 1) * ni
                # int_values = np.arange(start, end, dtype='int64')
                print(f'start insert {start}:{end}')
                data = [vectors[start: end]]
                collection_w.insert(data)
            log.info(collection_w.num_entities)
            log.info(f'{i} insert cost: {time.time() - s}')

At one run, create collection hangs with timeout 20

utility.list_collections()
['disk_3Y7MChs5', 'disk_r2NHiyMw', 'disk_pem1gMUE', 'disk_WjNQUIyW', 'disk_hZbh5k6Q']

Milvus logs: milvus_logs.tar.gz

Expected Behavior

Maybe create collection successfully, or get the timeout exception

Steps To Reproduce

No response

Anything else?

Pulsar topics stats: https://zilliverse.feishu.cn/docs/doccn34iE0kVvTinUFLqo7ZfNdc

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 32 (32 by maintainers)

Commits related to this issue

Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Make DataNode release rather than delete when reassign (#17293) 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See... — committed to milvus-io/milvus by XuanYang-cn 2 years ago

Most upvoted comments

/assign @XuanYang-cn /unassign Deploy a fresh cluster with image ‘master-20220520-d525e955’, after datanode and datacoord restart 3 times because of chaos disk attack chaosd attack disk fill -c 87 -p /tmp, some subscriptions of the offline datanodes were not successfully unsubscribed: Obviously datanode 21 has gone offline

# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_0"
"by-dev-dataNode-42-433407184247455745"
"by-dev-dataNode-42-433406364697231361"
"by-dev-dataNode-21-433407184247455745"
"by-dev-dataNode-21-433406364697231361"
#
# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_1"
"by-dev-dataNode-42-433407184247455745"
"by-dev-dataNode-42-433406364697231361"
#
# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_2"
"by-dev-dataNode-42-433406364697231361"
#
# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_3"
"by-dev-dataNode-42-433406364697231361"
#
# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_4"
#

ThreadDao on May 23, 2022