milvus: [Bug]: Datanode get panic after OOM and restart several times

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: v2.2.3
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

With 1000 partitions, insert 5k rows for each partition repeatly, datanode get OOM with memory limit=1GB. After OOM several times, reset memory limit to 16GB, restart datanode, and restart datacoord. The datanode get this warning and panic at flush_manger.go(Line No.852): [2023/02/17 08:54:43.026 +00:00] [WARN] [datanode/flush_manager.go:852] ["failed to SaveBinlogPaths"] ["segment ID"=439516335357283168] [error="All attempts results:\nattempt #1:segment 439516335357283168 not found\nattempt #2:segment 439516335357283168 not found\nattempt #3:segment 439516335357283168 not found\nattempt #4:segment 439516335357283168 not found\nattempt #5:segment 439516335357283168 not found\nattempt #6:segment 439516335357283168 not found\nattempt #7:segment 439516335357283168 not found\nattempt #8:segment 439516335357283168 not found\nattempt #9:segment 439516335357283168 not found\nattempt #10:segment 439516335357283168 not found\nattempt #11:segment 439516335357283168 not found\n"]

Expected Behavior

Data node should be healthy after reset memory limit to 16GB.

Steps To Reproduce

1. use docker-compose to luanch a cluster by this yaml
https://github.com/milvus-io/milvus/files/10766154/deployment_yaml.zip
`docker-compose --compatibility up -d`
2. create a collection with two fields: id(int64, auto_id=true) and vector(512 dim)
3. create 1000 partitions with names "part_1", "part_2", "part_3"......
4. insert 5k rows to each partitions repeatly by this order:
insert(5k, part_1) ---> insert(5k, part_2) ---> insert(5k, part_3) ......
5. the datanode will run out of memory quickly
6. restart the datanode
`docker restart [datanode_container_id]`
7. insert more data
8. the datanode will be OOM and kill by docker
9. restart datacoord manually (I think this step is required, you might need to restart it several times)
`docker restart [datacoord_container_id]`
10. after the datanode started, you will see it out of memory again
11. use command `docker update -m 20G [datanode_container_id]` to update the memory limit of the data node
12. in the client side, call flush() to flush the data
13. restart datanode manually again(maybe several times)
14. observe the datanode, it will panic by itself, check the log, you will see the "failed to SaveBinlogPaths" message

Milvus Log

No response

Anything else?

No response

About this issue

Original URL
State: closed
Created a year ago
Comments: 18 (16 by maintainers)

Most upvoted comments

After verification and discussion, a conclusion is that the memory-sync-policy can alleviate datanode OOMKilled but cannot completely prevent it. The following two scenarios are tested:

There are many (1000) partitions in the collection, but the amount of data inserted each time is small (ni=5000 ni, dim=512), memory-sync-policy can sync segments when the datanode memory usage is high (65%)
The collection partition is very small (1), and 6 client inserts in parallel, each inserting a large amount of data (ni=5w, dim=128), memory-sync-policy takes effect, but the datanode is still OOMKilled

ThreadDao on Mar 2, 2023