transformers: Multi GPU inference on RTX 4090 fails with RuntimeError: CUDA error: device-side assert triggered (Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.)

System Info

transformers version: 4.30.0.dev0 Platform: Linux 6.3.5-zen2-1-zen-x86_64-with-glibc2.37.3 on Arch Python version: 3.10.9 PyTorch version (GPU): 2.0.1+cu118 (True) peft version: 0.4.0.dev0 accelerate version: 0.20.0.dev0 bitsandbytes version: 0.39.0 nvidia driver version: nvidia-dkms-530.41.03-1

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Run the following code Code:

from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import os

# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"


model_name = "/models/wizard-vicuna-13B-HF"

tokenizer = LlamaTokenizer.from_pretrained(model_name)

model = LlamaForCausalLM.from_pretrained(model_name,
                                              device_map='auto',
                                              torch_dtype=torch.float16,
                                              )

pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

prompt = 'What are the difference between Llamas, Alpacas and Vicunas?'
raw_output = pipe(get_prompt(prompt))
parse_text(raw_output)

While this code works fine on a single 4090 GPU. Loading any model for inference with 2 or 3 RTX 4090 is resulting in the following error:

/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [68,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [70,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [72,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [73,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [74,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [75,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
--------many such lines----------

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:227, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    222         raise ValueError(
    223             f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
    224         )
    225     attn_weights = attn_weights + attention_mask
    226     attn_weights = torch.max(
--> 227         attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
    228     )
    230 # upcast attention to fp32
    231 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

Code does inference successfully.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (3 by maintainers)

Most upvoted comments

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

This issue is not happening after transformers update 4.30.2.

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

@Xnhyacinth were you able to solve this by any chance? I’m getting the same error with transformers 4.33.3 (my full case description is here)

@kerenganon @Xnhyacinth - Were either of you able to solve this? I’m getting the same error. It began when I upgraded the CUDA Driver Version from 11.? to 12.2 and updated the NVIDIA Driver Version to 535.113.01 - so perhaps related to driver versions in some way. Prior to upgrading the drivers I had no issues. After the upgrade, I get this error when I attempt to run inference using Llama models across multiple GPUs. The problem doesn’t occur if I just use a single GPU. I haven’t been able to see any improvement using changes to tokenizer eos or pad token_ids (as suggested elsewhere). The problem seems related to using device_map=“auto” (or similar). I’m using transformers 4.31.0, so it doesn’t seem to be fixed after 4.30.2 for me.