DeepSpeed: [BUG] DeepSpeed Inference with GPT-J using batches with padding gives wrong outputs

Describe the bug Using DeepSpeed Inference (using deepspeed.init_inference) gives weird outputs when using batch size > 1 and padding the inputs.

I’ll first state the problem with more detail and then explain what I tried in order to narrow it down.

The problem: I’m trying to run inference with GPT-J (EleutherAI/gpt-j-6B) on a very large dataset and therefore want to achieve the highest throughput possible for my setup. I’m using a p3.16xlarge instance with 8 V100 GPUs so I can in theory fit a batch size of more than 1, since DeepSpeed helps sharing the tensors across the GPUs. Since the inputs are of different length, I have to use padding. This is how I pad (let’s assume batch_size=4 so len(input_texts) = 4):

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
tokenized_inputs = tokenizer(
            list(input_texts), 
            return_tensors='pt', 
            padding=True,
            max_length=tokenizer.model_max_length - args.max_new_tokens, 
            truncation=True,
        )

Now to the problem, assuming these are sequence lengths of each input (in number of tokens of course): idx0, 1452 idx1, 1588 idx2, 1055 idx3, 650

The outputs I get from the model will be exactly what I expect for idx1 (since it’s the longest and has no padding), very close close to what I expect for idx0, but terrible for idx3. What I “expect” is either when I run the exact same code with DeepSpeed with batch_size=1 or when I run the same code without DeepSpeed on CPU with batch_size=4. On both of these cases (DeepSpeed bsz=1 and CPU bsz=4) the outputs are identical, and they also make sense (it’s an extraction task so I can tell whether it makes sense or not).

I tried figuring out what exactly causes this problem and based on the evidence I’ve gathered I think that the sequences that have a long padding on the left side somehow accumulate a huge attention weight that is not correctly masked by the attention mask. My evidence is:

If I run with DeepSpeed bsz=4 and with torch.float16, the outputs I get are: !!! (no matter the prompt). But if I run it with torch.float32 I get “normal” outputs, but as I said, they differ from what I expect (defined above). So this makes me think some tensor overflows with f16 but not with f32. Also I should mention that running with DeepSpeed with fp16 and bsz=1 works perfectly.
The longest input in the batch (which has no padding at all) gives the expected result. Those that are close to it in length have only a slightly weird output (small amount of padding tokens). Those that are much shorter (many padding tokens) have highly unrelated output.
They way the GPT-J attention mechanism works (at least the HuggingFace implementation) is that you add -10,000 to the attention weight where the attention mask is 0. This might not be enough if the many padding tokens accumulate a large attention weight. Although when I run it on CPU with the HuggingFace implementation everything is ok so it might not be the reason.
I’m pretty sure that the culprit is this function: https://github.com/microsoft/DeepSpeed/blob/a10e4811fe78b707289132c9695bade4715fe59b/csrc/transformer/inference/csrc/softmax.cu#L203 But unfortunately I don’t speak CUDA so it’s very hard for me to follow and point exactly what the problem is. For all I know the HuggingFace implementation of attention works (https://github.com/huggingface/transformers/blob/2c2a31ffbcfe03339b1721348781aac4fc05bc5e/src/transformers/models/gptj/modeling_gptj.py#L72).

To Reproduce

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = 2048

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
device = torch.device(f'cuda:{local_rank}')

model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
model.config.pad_token_id = model.config.eos_token_id

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.float32,
    replace_method='auto',
    replace_with_kernel_inject=True,
)
model.device = device

tokenized_inputs = tokenizer(
    list(input_texts), 
    return_tensors='pt', 
    padding=True,
    max_length=tokenizer.model_max_length - args.max_new_tokens, 
    truncation=True,
).to(device)

with torch.inference_mode():
    batch_output_tokens = model.generate(
        input_ids=tokenized_inputs['input_ids'],
        attention_mask=tokenized_inputs['attention_mask'],
        do_sample=False,
        max_new_tokens=args.max_new_tokens,
        min_length=tokenized_inputs.input_ids.shape[1]+args.max_new_tokens,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )

batch_output_text = tokenizer.batch_decode(batch_output_tokens, skip_special_tokens=True)

Expected behavior Running DeepSpeed with batch_size=1 or batch_size=4 (or larger) should give the same outputs. Running DeepSpeed with fp16 and batch size > 1 should work and not give: “!!!”.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
torch version .................... 1.10.2+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.6.0+2151c78, 2151c78, master
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.1

System info (please complete the following information): SageMaker instance p3.16xlarge with SageMaker container 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04

Launcher context Launching with deepspeed --num_gpus 8 run_inference.py

Docker context See above.

About this issue

Original URL
State: open
Created 2 years ago
Comments: 25 (8 by maintainers)

Most upvoted comments

Hi guys,

Sorry for my delay here! @codertimo Yes, you are right that the padding is not handled correctly for this model at softmax kernel. This has been fixed very recently for BLOOM model and I am gonna work on fixing it for the rest of models too. I am going to focus on this more and send a PR with a fix soon. Thanks, Reza

RezaYazdaniAminabadi on Aug 19, 2022

Thanks @RezaYazdaniAminabadi for fixing this!

Commit 4abd455521965930d0e921de8afc0073ea7df9d1 from the PR you mentioned fixes the problem when I tested it using a Huggingface gpt2 model. By the way: The commit aafba00c81eaf29c0c2b209a94bc31f4de942936 before that still had the bug.

I wasn’t able to test the PR on longer input sequences, though. The model seems to produce wrong/non-determenistic outputs there due to https://github.com/microsoft/DeepSpeed/issues/2243 . You mentioned that you might have a fix for that issue, too. Once you merge the fix to the latter issue, I will go ahead and test it also on the longer input sequences.

trianxy on Sep 12, 2022

Thanks @tomerip for looking back into this. I think this does appear to be the same underlying issue as https://github.com/microsoft/DeepSpeed/issues/2357. A fix for this will likely come from https://github.com/microsoft/DeepSpeed/pull/2433, but we are still seeing this issue on that PR currently. When we have an updated PR to test, I’ll update here.

cmikeh2 on Oct 25, 2022

Happy to help with testing any potential fixes!

If it will still take some time, then it would be great if there is a link with Bloom’s fix, so that we can create a fix ourselves.

trianxy on Sep 8, 2022

Great, thanks @RezaYazdaniAminabadi

trianxy on Aug 19, 2022

@RezaYazdaniAminabadi Hi! I met the same issue at GPT model which take the padded input_ids. Is there any update about this issue?

codertimo on Apr 26, 2022

Hey @tomerip,

Sorry for the long delay here. We have a deadline by the end of the week, and I can get more time on this issue next week. Hopefully, this should not take too much of the time. Thanks, Reza

RezaYazdaniAminabadi on Mar 30, 2022