milvus: [Bug]: Create collection hangs and cannot create producer on topic with backlog quota exceeded
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: master-20220316-d4ad785b
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.2.dev5
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory:
- GPU:
- Others:
Current Behavior
- deploy milvus cluster by operator, and all replicas are 1
- Run the following script multiple times: create a collection and insert 10 million vectors in multiples. However, during this period, the disk of the node was filled many times, causing many pods to be evicted and restarted. And I’m not sure if I ever dropped the collection.
host = "10.100.32.xxx"
port = 31746
hdf5_source_file = "/Users/nausicca/Downloads/vectors/sift-128-euclidean.hdf5"
ni = 50000
nb = 1000000
# define field and schema
collection_w = ApiCollectionWrapper()
filed_w = ApiFieldSchemaWrapper()
schema_w = ApiCollectionSchemaWrapper()
fields = [filed_w.init_field_schema(name="id", dtype=DataType.INT64, is_primary=True)[0],
filed_w.init_field_schema(name="vec", dtype=DataType.FLOAT_VECTOR, dim=128)[0]]
schema = schema_w.init_collection_schema(fields, auto_id=True)[0]
# create collection
collection_w.init_collection(name=cf.gen_unique_str("disk"), schema=schema, shards_num=1, timeout=20)
log.info(collection_w.num_entities)
dataset = h5py.File(hdf5_source_file)
# insert
vectors = np.array(dataset['train'])
# .astype(np.float(32))
for i in range(10):
s = time.time()
for i in range(nb // ni):
start = i * ni
end = (i + 1) * ni
# int_values = np.arange(start, end, dtype='int64')
print(f'start insert {start}:{end}')
data = [vectors[start: end]]
collection_w.insert(data)
log.info(collection_w.num_entities)
log.info(f'{i} insert cost: {time.time() - s}')
- At one run, create collection hangs with timeout 20
utility.list_collections()
['disk_3Y7MChs5', 'disk_r2NHiyMw', 'disk_pem1gMUE', 'disk_WjNQUIyW', 'disk_hZbh5k6Q']
Milvus logs: milvus_logs.tar.gz
Expected Behavior
Maybe create collection successfully, or get the timeout exception
Steps To Reproduce
No response
Anything else?
Pulsar topics stats: https://zilliverse.feishu.cn/docs/doccn34iE0kVvTinUFLqo7ZfNdc
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 32 (32 by maintainers)
Commits related to this issue
- Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Make DataNode release rather than delete when reassign 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See also: #1... — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Make DataNode release rather than delete when reassign (#17293) 1. Reassgin now will assign to the original Node if no other nodes avaliable 2. Make AddNode balance async: ToRealse + Reassign See... — committed to milvus-io/milvus by XuanYang-cn 2 years ago
/assign @XuanYang-cn /unassign Deploy a fresh cluster with image ‘master-20220520-d525e955’, after datanode and datacoord restart 3 times because of chaos disk attack
chaosd attack disk fill -c 87 -p /tmp
, some subscriptions of the offline datanodes were not successfully unsubscribed: Obviously datanode 21 has gone offline