lmdeploy: [Bug] Memory leak for api_server
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
Describe the bug
When starting the api_server and sending requests from client, the memory usage of the api_server will increase. If kill the client, the memory usage of server will not drop.
In our test environment, for every 1000 prompts, the memory usage will increase 0.1% (116G memory overall).
Reproduction
- start server:
lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port 23333 --tp 1
- start profiling script:
python benchmark/profile_restful_api.py --server_addr 0.0.0.0:23333 --tokenizer_path /path/to/tokenizer --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 128 --num_prompts 50000
- observe the memory usage change of api_server process by
htop
Environment
sys.platform: linux
Python: 3.9.16 (main, Aug 15 2023, 19:38:56) [GCC 8.3.1 20190311 (Red Hat 8.3.1-3)]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
PyTorch: 2.1.2+cu118
TorchVision: 0.16.2+cu121
LMDeploy: 0.2.5+
transformers: 4.37.1
gradio: 3.50.2
fastapi: 0.104.1
pydantic: 2.6.0
About this issue
- Original URL
- State: closed
- Created 3 months ago
- Comments: 22 (12 by maintainers)
Yeah, I’ve got time for this now. After some research and testing, it seems to me that we have to collect the garbage from the server by the interval.
I will pull a merge request ASAP.
Thanks, I can reproduce it!
Maybe we could call show_memory in the while loop at intervals. Some code snippets like this:
@AllentDan I tried to add the following code to debug:
And found the memory diff is mainly caused by the
dlpack
: