DeepSpeed: [BUG] Host Memory Efficiency

Describe the bug

Not sure if it’s a bug or the architecture.

When performing inference, more memory is consumed than the model checkpoint capacity in the current implementation. In experiments, it is estimated that the consumption is approximately doubled.

GPT-NEO 2.7B host memory estimate consumption : 99GB = 10 gpus * 9.9GB(checkpoint) host memory real consumption : 221GB

Also, since the amount of memory consumed is multiplied according to the number of num_gpus, much more host memory is required to use more gpus.

For small models, host memory will not be a problem, but when using large models, the checkpoint capacity of the model will also be large, so if we use a lot of gpu to load it, host memory oom will occur and this will be a problem.

It seems that the model checkpoint should be shared each process. Do you have any comments or improvement plans for this?

To Reproduce Steps to reproduce the behavior:

import os
import deepspeed
import torch
from transformers import pipeline
import time
import datetime

def init():
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    world_size = int(os.getenv('WORLD_SIZE', '1'))
    generator = pipeline(
        'text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank)
    generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.float,
                                            replace_method='auto')
    return generator

def predict(text, max_len):
    torch.distributed.barrier()
    with torch.no_grad():
        string = generator(text, do_sample=True,
                            min_length=max_len,
                            max_length=max_len,
                            top_k=50,
                            temperature=1.0,
                            top_p=1.0,
                            num_return_sequences=1,
                            pad_token_id=3)
    return string

if __name__ == '__main__':
    generator = init()
    torch.cuda.empty_cache()
    text = 'a'
    seq = [50, 100, 300, 1000, 2048]
    for i in seq:
        avg_time = 0
        for j in range(5):
            #print(f'##### max_len: {i}')
            startime = time.time()
            string = predict(text, i)
            torch.distributed.barrier()
            avg_time += (time.time()-startime)
        spend_time = str(datetime.timedelta(seconds=avg_time/5))
        print(f'[{torch.distributed.get_rank()}] ##### seq: {i}, avg_spend_time: {spend_time}')

Expected behavior A clear and concise description of what you expected to happen.

memory consumed just size of checkpoint

ds_report output Please run ds_report to give us details about your setup.

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
/tmp/io_submithubcvos0.c: In function ‘main’:
/tmp/io_submithubcvos0.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
    2 |     io_submit();
      |     ^~~~~~~~~
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+c3d40fd
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.4.4+3e4dd96, 3e4dd96, reyazda/large-model-inference
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3

System info (please complete the following information):

OS: Ubuntu 18.04
GPU count and types x16 A100
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version : 3.8
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

deepspeed --num_gpus [2,4,5,10] test.py

Docker context Are you using a specific docker image that you can share?

NGC docker 21.6

Additional context Add any other context about the problem here.

About this issue

Original URL
State: open
Created 3 years ago
Comments: 16 (13 by maintainers)

Most upvoted comments

@hyunwoongko

Cool! Awesome.

It’s a really needed feature. I really want this features to be merged.

I also experimented with your amazing open source Parallelformer a while ago.

There was a little issue in the Docker environment, so I did not proceed further, but if it is included in Deep Speed, a large open source, it will shine even more.

I look forward to that day.

thank you

switiz on Aug 25, 2021

@hyunwoongko, I will try to repro this again and open an issue on your side. Thanks, Reza

RezaYazdaniAminabadi on Aug 26, 2021