DeepSpeed: [BUG] Can't load OPT-30B and OPT-66B through checkpoints.json
Describe the bug
I can’t load OPT-30B and OPT-66B through checkpoints.json. If I load them with Huggingface from_pretrained
, everything works fine. This bug is troublesome because my production nodes have far less memory than my dev node, so they don’t have enough CPU memory to load OPT-30B and OPT-66B.
To Reproduce python 3.7.7
git clone https://github.com/anselmwang/transformers-bloom-inference/
cd transformers-bloom-inference
git checkout explore_ds
pip install --upgrade pip
pip install transformers>=4.21.3 accelerate>=0.12.0
pip install deepspeed>=0.7.3
Without checkpoints_json, this command works date; deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name facebook/opt-30b; date
Below is the stack trace when using checkpoints.json date; deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name facebook/opt-30b --use_checkpoints_json; date
Traceback (most recent call last):
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 192, in <module>
model = deepspeed.init_inference(
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 127, in __init__
self.module.to(device)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1682, in to
return super().to(*args, **kwargs)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 987, in to
return self._apply(convert)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 662, in _apply
param_applied = fn(param)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 985, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
For OPT-66B, this command works date; deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name facebook/opt-66b; date
But when turning on checkpoints.json, date; deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name facebook/opt-66b --use_checkpoints_json; date
, below is the stack trace
Traceback (most recent call last): [9/1869]
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 190, in <module>
model = deepspeed.init_inference(
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 124, in __init__
self._apply_injection_policy(config)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 349, in _apply_injection_policy replace_transformer_layer(client_module,
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 926, in replace_transformer_layer
load_model_with_checkpoint(
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/module_inject/load_checkpoint.py", line 349, in load_model_with_checkpoin
t
load_module_recursive(r_module)
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/module_inject/load_checkpoint.py", line 343, in load_module_recursive
load_module_recursive(
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/module_inject/load_checkpoint.py", line 343, in load_module_recursive
load_module_recursive(
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/module_inject/load_checkpoint.py", line 343, in load_module_recursive
load_module_recursive(
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/module_inject/load_checkpoint.py", line 341, in load_module_recursive
layer_policies[child.__class__](child, prefix + name + '.')
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/module_inject/load_checkpoint.py", line 258, in load_transformer_layer
maybe_copy_qkv(module.attention,
File "/home/yuwan/GitRoot/opt_pipeline/transformers-bloom-inference/venv/lib/python3.9/site-packages/deepspeed/module_inject/load_checkpoint.py", line 203, in maybe_copy_qkv
k = sd[0][src_names[1]]
KeyError: 'model.decoder.layers.28.self_attn.k_proj.weight'
Expected behavior
ds_report output
Please run ds_report
to give us details about your setup.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/tmp/code/transformers-bloom-inference/venv/lib/python3.7/site-packages/torch']
torch version .................... 1.13.0+cu117
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed install path ........... ['/tmp/code/transformers-bloom-inference/venv/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.7.6, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 18.04
- GPU count and types: 1 node with 4 A6000, 46GB memory per GPU
- (if applicable) what DeepSpeed-MII version are you using
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
transformers 4.25.1
deepspeed 0.7.7
torch 1.13.0
- Python version: 3.7.7
- Any other relevant info about your setup
Docker context Are you using a specific docker image that you can share? Not use docker Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 19 (4 by maintainers)
@RezaYazdaniAminabadi I can confirm that version 0.8.0 fixed the issue for me.
I can confirm that I’m able to replicate this. Interestingly, I’m finding that smaller OPT models work loading with meta tensor. It appears that models that are split in the HuggingFace checkpoints are causing this error (e.g., they have multiple
pytorch_model-*-of-*.bin
).@RezaYazdaniAminabadi any idea the cause? I’m guessing we don’t catch this in our unit tests because we use small versions of these larger models to save time.