transformers: `bigscience/T0` multi-gpu inference exits with return code -9
Environment info
transformers
version: 4.17.0.dev0- Platform: Linux-5.13.0-37-generic-x86_64-with-glibc2.10
- Python version: 3.8.0
- PyTorch version (GPU?): 1.10.1 (True)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes (deepspeed)
Who can help
Library:
- Deepspeed: @stas00
- Text generation: @patrickvonplaten @Narsil
Information
Model I am using: T0
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
I want to load T0 across two 24GB GPUs with DeepSpeed in order to run inference. I followed the example code given here in issue #15399.
When running the code below, after the model says finished initializing model with 11.14B parameters
, it quits without outputting a model response. It does not give an error or traceback, just a return code of -9:
[2022-04-05 16:18:09,845] [WARNING] [runner.py:155:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-04-05 16:18:09,912] [INFO] [runner.py:438:main] cmd = /home/aadelucia/miniconda3/envs/fda_cersi_tobacco/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 multi_gpu_T0.py
[2022-04-05 16:18:10,635] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2022-04-05 16:18:10,635] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=2, node_rank=0
[2022-04-05 16:18:10,635] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2022-04-05 16:18:10,635] [INFO] [launch.py:123:main] dist_world_size=2
[2022-04-05 16:18:10,635] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2022-04-05 16:18:11,702] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2022-04-05 16:18:56,295] [INFO] [partition_parameters.py:456:__exit__] finished initializing model with 11.14B parameters
[2022-04-05 16:19:40,754] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 406939
[2022-04-05 16:19:40,754] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 406940
[2022-04-05 16:19:40,754] [ERROR] [launch.py:184:sigkill_handler] ['/home/aadelucia/miniconda3/envs/fda_cersi_tobacco/bin/python', '-u', 'multi_gpu_T0.py', '--local_rank=1'] exits with return code = -9
Here is the code. Run with deepspeed --num_gpus 2 <script.py>
"""
Example code to load a PyTorch model across GPUs
Code from https://github.com/huggingface/transformers/issues/15399
"""
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import torch
import pdb
import os
from tqdm import tqdm
import re
seed = 42
torch.manual_seed(seed)
###
# Deepspeed setup
###
# To avoid warnings about parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# distributed setup
local_rank = int(os.getenv('LOCAL_RANK', '0')) # TODO use this
world_size = int(os.getenv('WORLD_SIZE', '1'))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()
model_name = "bigscience/T0"
config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model
ds_config = {
"fp16": {
"enabled": False,
},
"bf16": {
"enabled": True,
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": True
},
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_bucket_size": model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
"stage3_param_persistence_threshold": 10 * model_hidden_size
},
"steps_per_print": 2000,
# batch size has to be divisible by world_size, but can be bigger than world_size
"train_batch_size": 1 * world_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False
}
# Initialize model
# must setup HfDeepSpeedConfig before instantiating the model
# ds_config is deepspeed config object or path to the file
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024) # should be 1024
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# we are ready to initialise deepspeed ZeRO now
ds_engine = deepspeed.initialize(model=model,
config_params=ds_config,
model_parameters=None,
optimizer=None,
lr_scheduler=None)[0]
ds_engine.module.eval() # inference
rank = torch.distributed.get_rank()
text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
# Generation options
# https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/model#transformers.generation_utils.GenerationMixin.generate
with torch.no_grad():
outputs = ds_engine.module.generate(inputs, max_length=256)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n in={text_in}\n out={text_out}")
Expected behavior
T0 should load across 2 GPUs, generate an answer, and then quit.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (13 by maintainers)
@stas00, thank you so much for your help! I’m answering for @gportill since we were working on this issue together.
Summary of what worked:
Install transformers and DeepSpeed from GitHub
If using NVME offload, set up Linux-native asynchronous I/O facility:
If using CPU offload, increase swap memory with Stas’ directions: https://github.com/huggingface/transformers/issues/16616#issuecomment-1102834737
Load sharded model (only some models are available sharded, T0 and T0pp included)
Full working example:
This example was modified from https://github.com/huggingface/transformers/issues/15399#issue-1117950014 and assumes all of the “summary of what worked” steps were taken.
And the following code to run:
OK, I managed to crash my system with the 11B version with 2 gpus.
Need to figure out cgroup v2 as I moved to Ubuntu 21.10 and my v1 setup no longer works.
Meanwhile I figured out how to run a shell that will not any processes started from it use more memory than I told it to and thus not kill the host:
but since we have this huge checkpoint of 42GB I don’t have enough RAM to load it twice in 2 processes. We have just added sharded checkpoints so need to switch T0 to it.
And meanwhile I’m trying to figure out how to get this to run with nvme offload.
I will update more once I have something running.
That’s a really neat summary and code parametrization, @AADeLucia - great work!
Just to add that with the sharded model it’s now possible to infer T0 (42GB) and other similar models in fp32 using just 2x 24GB gpus, w/ deepspeed w/o any offload.
But if you have smaller GPUs, or just one GPU or larger models then the above script allows you to offload to cpu RAM if you have lots of it and if not so much to an NVMe device - each making the performance progressively slower.
And once:
transformers>1.18.0
deepspeed>0.6.3
are available you can install the released versions instead of the git versions.Here is the nvme offload version that I tested with. Works great even with 1x or 2x tiny gpu - I didn’t see more than 3GB used on each, but it’s slow of course.