transformers: `bigscience/T0` multi-gpu inference exits with return code -9

Environment info

transformers version: 4.17.0.dev0
Platform: Linux-5.13.0-37-generic-x86_64-with-glibc2.10
Python version: 3.8.0
PyTorch version (GPU?): 1.10.1 (True)
Tensorflow version (GPU?): 2.8.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes (deepspeed)

Who can help

Library:

Deepspeed: @stas00
Text generation: @patrickvonplaten @Narsil

Information

Model I am using: T0

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

I want to load T0 across two 24GB GPUs with DeepSpeed in order to run inference. I followed the example code given here in issue #15399.

When running the code below, after the model says finished initializing model with 11.14B parameters, it quits without outputting a model response. It does not give an error or traceback, just a return code of -9:

[2022-04-05 16:18:09,845] [WARNING] [runner.py:155:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-04-05 16:18:09,912] [INFO] [runner.py:438:main] cmd = /home/aadelucia/miniconda3/envs/fda_cersi_tobacco/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 multi_gpu_T0.py
[2022-04-05 16:18:10,635] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2022-04-05 16:18:10,635] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=2, node_rank=0
[2022-04-05 16:18:10,635] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2022-04-05 16:18:10,635] [INFO] [launch.py:123:main] dist_world_size=2
[2022-04-05 16:18:10,635] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2022-04-05 16:18:11,702] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2022-04-05 16:18:56,295] [INFO] [partition_parameters.py:456:__exit__] finished initializing model with 11.14B parameters
[2022-04-05 16:19:40,754] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 406939
[2022-04-05 16:19:40,754] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 406940
[2022-04-05 16:19:40,754] [ERROR] [launch.py:184:sigkill_handler] ['/home/aadelucia/miniconda3/envs/fda_cersi_tobacco/bin/python', '-u', 'multi_gpu_T0.py', '--local_rank=1'] exits with return code = -9

Here is the code. Run with deepspeed --num_gpus 2 <script.py>

"""
Example code to load a PyTorch model across GPUs

Code from https://github.com/huggingface/transformers/issues/15399
"""
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import torch
import pdb
import os
from tqdm import tqdm
import re

seed = 42
torch.manual_seed(seed)

###
# Deepspeed setup
###
# To avoid warnings about parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# distributed setup
local_rank = int(os.getenv('LOCAL_RANK', '0'))  # TODO use this
world_size = int(os.getenv('WORLD_SIZE', '1'))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

model_name = "bigscience/T0"
config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model

ds_config = {
    "fp16": {
        "enabled": False,
    },
    "bf16": {
        "enabled": True,
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    # batch size has to be divisible by world_size, but can be bigger than world_size
    "train_batch_size": 1 * world_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}

# Initialize model
# must setup HfDeepSpeedConfig before instantiating the model
# ds_config is deepspeed config object or path to the file
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)  # should be 1024
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# we are ready to initialise deepspeed ZeRO now
ds_engine = deepspeed.initialize(model=model,
                                 config_params=ds_config,
                                 model_parameters=None,
                                 optimizer=None,
                                 lr_scheduler=None)[0]
ds_engine.module.eval()  # inference
rank = torch.distributed.get_rank()
text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"

inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)

# Generation options
# https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/model#transformers.generation_utils.GenerationMixin.generate
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, max_length=256)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

Expected behavior

T0 should load across 2 GPUs, generate an answer, and then quit.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 20 (13 by maintainers)

Most upvoted comments

@stas00, thank you so much for your help! I’m answering for @gportill since we were working on this issue together.

Summary of what worked:

Install transformers and DeepSpeed from GitHub

pip install git+http://github.com/huggingface/transformers.git#egg=transformers
pip install git+http://github.com/microsoft/DeepSpeed.git#egg=deepspeed

If using NVME offload, set up Linux-native asynchronous I/O facility:
```
sudo apt install libaio-dev
```
If using CPU offload, increase swap memory with Stas’ directions: https://github.com/huggingface/transformers/issues/16616#issuecomment-1102834737
Load sharded model (only some models are available sharded, T0 and T0pp included)
```
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, revision="sharded")
```

Full working example:

This example was modified from https://github.com/huggingface/transformers/issues/15399#issue-1117950014 and assumes all of the “summary of what worked” steps were taken.

#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py

# Imports
from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
# To avoid warnings about parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import torch
from argparse import ArgumentParser


#################
# DeepSpeed Config
#################
def generate_ds_config(args):
    """
    ds_config notes

    - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
    faster.

    - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
    all official t5 models are bf16-pretrained

    - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
    - want CPU offload

    - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
    - which params should remain on gpus - the larger the value the smaller the offload size

    For indepth info on Deepspeed config see
    https://huggingface.co/docs/transformers/main/main_classes/deepspeed
    keeping the same format as json for consistency, except it uses lower case for true/false
    fmt: off
    """

    config = AutoConfig.from_pretrained(args.model_name)
    world_size = int(os.getenv("WORLD_SIZE", "1"))
    model_hidden_size = config.d_model

    # batch size has to be divisible by world_size, but can be bigger than world_size
    train_batch_size = args.batch_size * world_size

    config = {
        "fp16": {
            "enabled": False
        },
        "bf16": {
            "enabled": False
        },
        "zero_optimization": {
            "stage": 3,
            "offload_param": {
                "device": args.offload,
                "nvme_path": args.nvme_offload_path,
                "pin_memory": True,
                "buffer_count": 6,
                "buffer_size": 1e8,
                "max_in_cpu": 1e9
            },
            "aio": {
                "block_size": 262144,
                "queue_depth": 32,
                "thread_count": 1,
                "single_submit": False,
                "overlap_events": True
            },
            "overlap_comm": True,
            "contiguous_gradients": True,
            "reduce_bucket_size": model_hidden_size * model_hidden_size,
            "stage3_prefetch_bucket_size": 0.1 * model_hidden_size * model_hidden_size,
            "stage3_max_live_parameters": 1e8,
            "stage3_max_reuse_distance": 1e8,
            "stage3_param_persistence_threshold": 10 * model_hidden_size
        },
        "steps_per_print": 2000,
        "train_batch_size": train_batch_size,
        "train_micro_batch_size_per_gpu": 1,
        "wall_clock_breakdown": False
    }
    return config


#################
# Helper Methods
#################
def parse_args():
    """Parse program options"""
    parser = ArgumentParser()
    parser.add_argument("--model-name", default="bigscience/T0", help="Name of model to load.")
    parser.add_argument("--offload", choices=["nvme", "cpu", "none"], default="none",
                        help="DeepSpeed optimization offload choices for ZeRO stage 3.")
    parser.add_argument("--nvme-offload-path", default="/tmp/nvme-offload",
                        help="Path for NVME offload. Ensure path exists with correct write permissions.")
    parser.add_argument("--batch-size", default=1, help="Effective batch size is batch-size * # GPUs")
    return parser.parse_args()


#################
# Main
#################
# Distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()
args = parse_args()
ds_config = generate_ds_config(args)

# fmt: on
# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
# Special version of T0
revision = None
if args.model_name in ["bigscience/T0", "bigscience/T0pp"]:
    revision = "sharded"
model = AutoModelForSeq2SeqLM.from_pretrained(args.model_name, revision=revision)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(args.model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)

# synced_gpus (bool, optional, defaults to False) —
# Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs —
# Additional model specific keyword arguments will be forwarded to the forward function of the model.
# If model is an encoder-decoder model the kwargs should include encoder_outputs.
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}\n")

And the following code to run:

export CUDA_LAUNCH_BLOCKING=0
export OMP_NUM_THREADS=1
python -m torch.distributed.run --nproc_per_node=2 T0_inference.py

AADeLucia on May 6, 2022

OK, I managed to crash my system with the 11B version with 2 gpus.

Need to figure out cgroup v2 as I moved to Ubuntu 21.10 and my v1 setup no longer works.

Meanwhile I figured out how to run a shell that will not any processes started from it use more memory than I told it to and thus not kill the host:

systemd-run --user --scope -p MemoryHigh=100G -p MemoryMax=110G -p MemorySwapMax=60G bash

but since we have this huge checkpoint of 42GB I don’t have enough RAM to load it twice in 2 processes. We have just added sharded checkpoints so need to switch T0 to it.

And meanwhile I’m trying to figure out how to get this to run with nvme offload.

I will update more once I have something running.

stas00 on Apr 6, 2022

That’s a really neat summary and code parametrization, @AADeLucia - great work!

Just to add that with the sharded model it’s now possible to infer T0 (42GB) and other similar models in fp32 using just 2x 24GB gpus, w/ deepspeed w/o any offload.

But if you have smaller GPUs, or just one GPU or larger models then the above script allows you to offload to cpu RAM if you have lots of it and if not so much to an NVMe device - each making the performance progressively slower.

And once:

transformers>1.18.0
deepspeed>0.6.3 are available you can install the released versions instead of the git versions.

stas00 on May 6, 2022

Here is the nvme offload version that I tested with. Works great even with 1x or 2x tiny gpu - I didn’t see more than 3GB used on each, but it’s slow of course.

#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

model_name = "bigscience/T0"
#model_name = "bigscience/T0_3B"

config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# XXX: modified this script to use nvme offload so need to explain the new configs, but the key is
# to change the path to `nvme_path`

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/mnt/nvme0/offload",
            "pin_memory": True,
            "buffer_count": 6,
            "buffer_size": 1e8,
            "max_in_cpu": 1e9
        },
        "aio": {
            "block_size": 262144,
            "queue_depth": 32,
            "thread_count": 1,
            "single_submit": False,
            "overlap_events": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.1 * model_hidden_size * model_hidden_size,
        "stage3_max_live_parameters": 1e8,
        "stage3_max_reuse_distance": 1e8,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
#from transformers.deepspeed import is_deepspeed_zero3_enabled
#print(f"Deepspeed 3 is enabled: {is_deepspeed_zero3_enabled()}")
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

stas00 on Apr 8, 2022