accelerate: Possible memory leak when inferencing BLOOM 176B

System Info

- `Accelerate` version: 0.11.0
- Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (True)
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Script: https://github.com/mayank31398/Megatron-DeepSpeed/blob/add-generation-server/scripts/inference/bloom-accelerate-server.py

Usage: python scripts/inference/bloom-accelerate-server.py --model_name bigscience/bloom --dtype bf16 --log_file data.log --host $ADDRESS --port $PORT

Memory blowup over time discussed here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/308#issuecomment-1205757494

Expected behavior

This memory leak should not occur I think.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 20 (6 by maintainers)

Most upvoted comments

This is not an issue anymore. Thanks for helping guys. Closing this 😃

mayank31398 on Sep 20, 2022

what @ydshieh said and more:

to track real memory usage / debug potential leaks always:

call gc.collect() first - since python’s GC is scheduled and w/o it you might miss object and its associated memory release
then clear the cuda cache
measure

but don’t do any of the above for production.

stas00 on Aug 23, 2022

Hi, I have something for you @mayank31398 . Below is an example with t5-large (GPT2 is quite small to see the difference).

model = model.to("cuda"): 3662 MB
After the generation is done, but not return yet: 7746 MB
After the generation is done, and return to the main: 7746 MB
After empty cache: 3732M

So emptying cache can bring the GPU memory usage back to the point where model is loaded to GPU.

So I believe there is no issue when we do things locally. However, when combining with web frameworks, things get more complicated, and the measurement depends on your needs. In any case, I would suggest you check the initial GPU memory usage (after the model loaded to GPU), and monitor its usage once inferences are performed.

Let us know if you have further question.

Here is the code (The FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY are copied from https://github.com/huggingface/transformers/blob/0f257a87749e0a72bda260c6f319a45dae1e7c4d/tests/models/t5/test_modeling_t5.py#L924)

import pdb

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch


def run(model, dct):

    hypotheses_batch = model.generate(
        **dct,
        num_beams=4,
        length_penalty=2.0,
        max_length=142,
        min_length=56,
        no_repeat_ngram_size=3,
        do_sample=False,
        early_stopping=True,
    )

    print("gen. done")
    pdb.set_trace()


if __name__ == "__main__":

    ckpt = "t5-large"
    tokenizer = AutoTokenizer.from_pretrained(ckpt)

    model = AutoModelForSeq2SeqLM.from_pretrained(ckpt)
    model.config.update(model.config.task_specific_params["summarization"])

    dct = tokenizer(
        [model.config.prefix + x for x in [FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY]],
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    print("input / model on cpu")
    pdb.set_trace()

    dct = dct.to("cuda")
    print("input to gpu")
    pdb.set_trace()

    model = model.to("cuda")
    print("model to gpu")
    pdb.set_trace()

    run(model, dct)
    print("model run done")
    pdb.set_trace()

    torch.cuda.empty_cache()
    print("clear done")
    pdb.set_trace()

ydshieh on Aug 23, 2022

I converted my server to flask and ran with gunicorn with 1 worker. This serializes all requests however

mayank31398 on Oct 21, 2022

Yes, thanks I think Ill try to see the memory usage over time by running in a for loop or something. To see how this changes memory (both in server and non-server settings).

mayank31398 on Aug 23, 2022

You may also be able to get a bit more by doing garbage collection as well, after deleting the model in python

E.g.:
import gc
del model
gc.collect()
(Also sorry for accidently closing, on mobile and hit the wrong button!)

Deleting the model is not an option for me. I am trying to use the model in a server setting for a lot of folks. Related PR: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328

mayank31398 on Aug 23, 2022

PyTorch memory allocator allocates some memory a priori for tensors and that is shown up in nvidia-smi.

This is correct. I also had an issue previously, see here. In short, PyTorch won’t always release GPU memory - it can re-use them later (faster operations), so it doesn’t mean there is memory issue.

But after empty_cache, we should see the usage drops down, although might be partially. So from your screenshot (interactive shell), it’s strange that nothing is released.

ydshieh on Aug 23, 2022

Hi @mayank31398

could you try to run your code snippet as a script, and measure the memory usage

after model is loaded
after a single model forward pass
model generate

You can set break point in the script to do so.

ydshieh on Aug 23, 2022

Thanks, I can confirm that this issue is not occuring with Starlette and FastAPI (built on top of Starlette). Not sure why this happens with Flask. Closing this ❤️

mayank31398 on Aug 22, 2022