milvus: [Bug]: Create collection hangs and cannot create producer on topic with backlog quota exceeded

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20220316-d4ad785b
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.2.dev5
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. deploy milvus cluster by operator, and all replicas are 1
  2. Run the following script multiple times: create a collection and insert 10 million vectors in multiples. However, during this period, the disk of the node was filled many times, causing many pods to be evicted and restarted. And I’m not sure if I ever dropped the collection.
        host = "10.100.32.xxx"
        port = 31746
        hdf5_source_file = "/Users/nausicca/Downloads/vectors/sift-128-euclidean.hdf5"
        ni = 50000
        nb = 1000000      
        # define field and schema
        collection_w = ApiCollectionWrapper()
        filed_w = ApiFieldSchemaWrapper()
        schema_w = ApiCollectionSchemaWrapper()
        fields = [filed_w.init_field_schema(name="id", dtype=DataType.INT64, is_primary=True)[0],
                  filed_w.init_field_schema(name="vec", dtype=DataType.FLOAT_VECTOR, dim=128)[0]]
        schema = schema_w.init_collection_schema(fields, auto_id=True)[0]

        # create collection
        collection_w.init_collection(name=cf.gen_unique_str("disk"), schema=schema, shards_num=1, timeout=20)
        log.info(collection_w.num_entities)

        dataset = h5py.File(hdf5_source_file)

        # insert
        vectors = np.array(dataset['train'])
        # .astype(np.float(32))
        for i in range(10):
            s = time.time()
            for i in range(nb // ni):
                start = i * ni
                end = (i + 1) * ni
                # int_values = np.arange(start, end, dtype='int64')
                print(f'start insert {start}:{end}')
                data = [vectors[start: end]]
                collection_w.insert(data)
            log.info(collection_w.num_entities)
            log.info(f'{i} insert cost: {time.time() - s}')
  1. At one run, create collection hangs with timeout 20
utility.list_collections()
['disk_3Y7MChs5', 'disk_r2NHiyMw', 'disk_pem1gMUE', 'disk_WjNQUIyW', 'disk_hZbh5k6Q']

Milvus logs: milvus_logs.tar.gz

Expected Behavior

Maybe create collection successfully, or get the timeout exception

Steps To Reproduce

No response

Anything else?

Pulsar topics stats: https://zilliverse.feishu.cn/docs/doccn34iE0kVvTinUFLqo7ZfNdc

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 32 (32 by maintainers)

Commits related to this issue

Most upvoted comments

/assign @XuanYang-cn /unassign Deploy a fresh cluster with image ‘master-20220520-d525e955’, after datanode and datacoord restart 3 times because of chaos disk attack chaosd attack disk fill -c 87 -p /tmp, some subscriptions of the offline datanodes were not successfully unsubscribed: Obviously datanode 21 has gone offline

# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_0"
"by-dev-dataNode-42-433407184247455745"
"by-dev-dataNode-42-433406364697231361"
"by-dev-dataNode-21-433407184247455745"
"by-dev-dataNode-21-433406364697231361"
#
# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_1"
"by-dev-dataNode-42-433407184247455745"
"by-dev-dataNode-42-433406364697231361"
#
# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_2"
"by-dev-dataNode-42-433406364697231361"
#
# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_3"
"by-dev-dataNode-42-433406364697231361"
#
# ./pulsar-admin topics subscriptions "persistent://public/default/by-dev-rootcoord-dml_4"
#