accelerate: Possible memory leak when inferencing BLOOM 176B

System Info

- `Accelerate` version: 0.11.0
- Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (True)
- `Accelerate` default config:
	Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Script: https://github.com/mayank31398/Megatron-DeepSpeed/blob/add-generation-server/scripts/inference/bloom-accelerate-server.py

Usage: python scripts/inference/bloom-accelerate-server.py --model_name bigscience/bloom --dtype bf16 --log_file data.log --host $ADDRESS --port $PORT

Memory blowup over time discussed here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/308#issuecomment-1205757494

Expected behavior

This memory leak should not occur I think.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (6 by maintainers)

Most upvoted comments

This is not an issue anymore. Thanks for helping guys. Closing this 😃

what @ydshieh said and more:

to track real memory usage / debug potential leaks always:

  1. call gc.collect() first - since python’s GC is scheduled and w/o it you might miss object and its associated memory release
  2. then clear the cuda cache
  3. measure

but don’t do any of the above for production.

Hi, I have something for you @mayank31398 . Below is an example with t5-large (GPT2 is quite small to see the difference).

  • model = model.to("cuda"): 3662 MB
  • After the generation is done, but not return yet: 7746 MB
  • After the generation is done, and return to the main: 7746 MB
  • After empty cache: 3732M

So emptying cache can bring the GPU memory usage back to the point where model is loaded to GPU.

So I believe there is no issue when we do things locally. However, when combining with web frameworks, things get more complicated, and the measurement depends on your needs. In any case, I would suggest you check the initial GPU memory usage (after the model loaded to GPU), and monitor its usage once inferences are performed.

Let us know if you have further question.

Here is the code (The FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY are copied from https://github.com/huggingface/transformers/blob/0f257a87749e0a72bda260c6f319a45dae1e7c4d/tests/models/t5/test_modeling_t5.py#L924)

import pdb

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch


def run(model, dct):

    hypotheses_batch = model.generate(
        **dct,
        num_beams=4,
        length_penalty=2.0,
        max_length=142,
        min_length=56,
        no_repeat_ngram_size=3,
        do_sample=False,
        early_stopping=True,
    )

    print("gen. done")
    pdb.set_trace()


if __name__ == "__main__":

    ckpt = "t5-large"
    tokenizer = AutoTokenizer.from_pretrained(ckpt)

    model = AutoModelForSeq2SeqLM.from_pretrained(ckpt)
    model.config.update(model.config.task_specific_params["summarization"])

    dct = tokenizer(
        [model.config.prefix + x for x in [FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY]],
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    print("input / model on cpu")
    pdb.set_trace()

    dct = dct.to("cuda")
    print("input to gpu")
    pdb.set_trace()

    model = model.to("cuda")
    print("model to gpu")
    pdb.set_trace()

    run(model, dct)
    print("model run done")
    pdb.set_trace()

    torch.cuda.empty_cache()
    print("clear done")
    pdb.set_trace()

I converted my server to flask and ran with gunicorn with 1 worker. This serializes all requests however

Yes, thanks I think Ill try to see the memory usage over time by running in a for loop or something. To see how this changes memory (both in server and non-server settings).

You may also be able to get a bit more by doing garbage collection as well, after deleting the model in python

E.g.:

import gc
del model
gc.collect()

(Also sorry for accidently closing, on mobile and hit the wrong button!)

Deleting the model is not an option for me. I am trying to use the model in a server setting for a lot of folks. Related PR: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328

PyTorch memory allocator allocates some memory a priori for tensors and that is shown up in nvidia-smi.

This is correct. I also had an issue previously, see here. In short, PyTorch won’t always release GPU memory - it can re-use them later (faster operations), so it doesn’t mean there is memory issue.

But after empty_cache, we should see the usage drops down, although might be partially. So from your screenshot (interactive shell), it’s strange that nothing is released.

Hi @mayank31398

could you try to run your code snippet as a script, and measure the memory usage

  • after model is loaded
  • after a single model forward pass
  • model generate

You can set break point in the script to do so.

Thanks, I can confirm that this issue is not occuring with Starlette and FastAPI (built on top of Starlette). Not sure why this happens with Flask. Closing this ❤️