DeepSpeed: [BUG] Incorrect Model Outputs When Using Beam Search

Describe the bug When I use kernel injection I get worse generation results than when using transformers without DeepSpeed. I don’t know if the results should be the same, but they are not only not the same, but even worse.

I saw this two issues marked as closed: https://github.com/microsoft/DeepSpeed/issues/2048 https://github.com/microsoft/DeepSpeed/issues/2230

But I use version after this fix(https://github.com/microsoft/DeepSpeed/pull/2489) and still have a problem

To Reproduce

Huggingface:

import random
import os
import numpy as np
import torch

def set_random_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:2"
    os.environ["PL_GLOBAL_SEED"] = str(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

set_random_seed(42)

params = {
 'num_beams': 2,
 'do_sample': False,
 'max_new_tokens': 65,
 'use_cache': True,
 'no_repeat_ngram_size': 5,
 'num_return_sequences': 1}

from transformers import AutoTokenizer, AutoModelForCausalLM

DEVICE = torch.device("cuda:0")
name = "EleutherAI/gpt-j-6B"
model = AutoModelForCausalLM.from_pretrained(name).to(DEVICE).eval().half()
tokenizer = AutoTokenizer.from_pretrained(name)

prompt = "Quantum computers are"

inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        **params
    )

print(prompt)
print()
print(tokenizer.decode(outputs[0])[len(prompt):].strip())

output:


Quantum computers are

the holy grail of computing. They promise to solve problems that are intractable on today’s supercomputers, and they could be the key to solving some of the world’s most pressing problems.

But they’re not quite there yet.

Quantum computers are still in

DeepSpeed

import random
import os
import numpy as np
import torch

def set_random_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:2"
    os.environ["PL_GLOBAL_SEED"] = str(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

set_random_seed(42)

params = {
 'num_beams': 2,
 'do_sample': False,
 'max_new_tokens': 65,
 'use_cache': True,
 'no_repeat_ngram_size': 5,
 'num_return_sequences': 1}

from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed

DEVICE = torch.device("cuda:0")
name = "EleutherAI/gpt-j-6B"
model = AutoModelForCausalLM.from_pretrained(name).to(DEVICE).eval().half()
tokenizer = AutoTokenizer.from_pretrained(name)

model = deepspeed.init_inference(
    model=model,     
    mp_size=1,       
    dtype=torch.float16, 
    replace_method="auto", 
    replace_with_kernel_inject=True,
)

prompt = "Quantum computers are"

inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        **params
    )

print(prompt)
print()
print(tokenizer.decode(outputs[0])[len(prompt):].strip())

output:



Quantum computers are

the holy grail of modern science. Physics World War II, but they’s.


Quantum computers are a holy grail of quantum computers are a holy

of modern.
quantum-computers are a holy gra
of.

and

are.

.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/mikhail/venv/lib/python3.7/site-packages/torch']
torch version .................... 1.13.0+cu116
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/mikhail/venv/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.7.6+a4ceabb6, a4ceabb6, master
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

System info (please complete the following information):

OS: Debian GNU/Linux 10 (buster) (GNU/Linux 4.19.0-21-cloud-amd64 x86_64\n)
CUDA Version: 11.6
x1 A100 40Gb
DeepSpeed 0.7.6+a4ceabb6
Hugging Face Transformers 4.18.0
Python 3.7

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 21 (3 by maintainers)

Most upvoted comments

We need num_beams > 1 also to actually use DeepSpeed.

+22

brevity2021 on May 3, 2023

@mallorbc I’ve done some benchmarks using gpt2 with fp16 precision on my own data (of course ymmv).

System info

cuda version 11.7
A10G instance 24G
DeepSpeed 0.7.7
Transformers 4.25.1
Python 3.7
Torch 1.13.1

in summary, with and w/o DeepSpeed:

Top-P sampling (top_p = 0.6, temperature = 0.6)

Score ~1% degradation
Latency ~2x speedup

Beam Search (beam = 3)

Score ~14% degradation (w/ some poor generations mixed in)
Latency ~2.5x speedup

Contrastive search (top_k = 4, penalty_alpha = 0.6)

Score ~62% degradation
Latency ~2.8x speedup (partly due to shorter generations)

Eta sampling (eta_cutoff = 0.0005)

Score: 0.05% degradation
Latency: ~2.2x speedup

So top p and eta sampling work great. Beam search and contrastive search degrade significantly

tokestermw on Feb 9, 2023

Is there any update or at least a way to disable KV-cache kernel injection? I dont need the speedup, but just deepspeed’s ability to split my simple huggingface gpt2 model over several GPUs

PelzKo on Jul 18, 2023

any update about num_beams > 1?

seanxcwang on Aug 11, 2023

Hi @zelcookie, thanks for reporting this. I am able to reproduce with your scripts and will work on determining a root cause of this.

cmikeh2 on Nov 14, 2022