accelerate: Possible memory leak when inferencing BLOOM 176B
System Info
- `Accelerate` version: 0.11.0
- Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (True)
- `Accelerate` default config:
Not found
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Usage: python scripts/inference/bloom-accelerate-server.py --model_name bigscience/bloom --dtype bf16 --log_file data.log --host $ADDRESS --port $PORT
Memory blowup over time discussed here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/308#issuecomment-1205757494
Expected behavior
This memory leak should not occur I think.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (6 by maintainers)
This is not an issue anymore. Thanks for helping guys. Closing this 😃
what @ydshieh said and more:
to track real memory usage / debug potential leaks always:
gc.collect()first - since python’s GC is scheduled and w/o it you might miss object and its associated memory releasebut don’t do any of the above for production.
Hi, I have something for you @mayank31398 . Below is an example with
t5-large(GPT2 is quite small to see the difference).model = model.to("cuda"): 3662 MBSo emptying cache can bring the GPU memory usage back to the point where model is loaded to GPU.
So I believe there is no issue when we do things locally. However, when combining with web frameworks, things get more complicated, and the measurement depends on your needs. In any case, I would suggest you check the initial GPU memory usage (after the model loaded to GPU), and monitor its usage once inferences are performed.
Let us know if you have further question.
Here is the code (The
FRANCE_ARTICLE,SHORTER_ARTICLE,IRAN_ARTICLE,ARTICLE_SUBWAYare copied from https://github.com/huggingface/transformers/blob/0f257a87749e0a72bda260c6f319a45dae1e7c4d/tests/models/t5/test_modeling_t5.py#L924)I converted my server to flask and ran with gunicorn with 1 worker. This serializes all requests however
Yes, thanks I think Ill try to see the memory usage over time by running in a for loop or something. To see how this changes memory (both in server and non-server settings).
Deleting the model is not an option for me. I am trying to use the model in a server setting for a lot of folks. Related PR: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328
This is correct. I also had an issue previously, see here. In short, PyTorch won’t always release GPU memory - it can re-use them later (faster operations), so it doesn’t mean there is memory issue.
But after
empty_cache, we should see the usage drops down, although might be partially. So from your screenshot (interactive shell), it’s strange that nothing is released.Hi @mayank31398
could you try to run your code snippet as a script, and measure the memory usage
You can set break point in the script to do so.
Thanks, I can confirm that this issue is not occuring with Starlette and FastAPI (built on top of Starlette). Not sure why this happens with Flask. Closing this ❤️