transformers: Multi GPU inference on RTX 4090 fails with RuntimeError: CUDA error: device-side assert triggered (Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.)

System Info

transformers version: 4.30.0.dev0 Platform: Linux 6.3.5-zen2-1-zen-x86_64-with-glibc2.37.3 on Arch Python version: 3.10.9 PyTorch version (GPU): 2.0.1+cu118 (True) peft version: 0.4.0.dev0 accelerate version: 0.20.0.dev0 bitsandbytes version: 0.39.0 nvidia driver version: nvidia-dkms-530.41.03-1

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Run the following code Code:

from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import os

# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"


model_name = "/models/wizard-vicuna-13B-HF"

tokenizer = LlamaTokenizer.from_pretrained(model_name)

model = LlamaForCausalLM.from_pretrained(model_name,
                                              device_map='auto',
                                              torch_dtype=torch.float16,
                                              )

pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

prompt = 'What are the difference between Llamas, Alpacas and Vicunas?'
raw_output = pipe(get_prompt(prompt))
parse_text(raw_output)

While this code works fine on a single 4090 GPU. Loading any model for inference with 2 or 3 RTX 4090 is resulting in the following error:

/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [68,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [70,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [72,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [73,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [74,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [75,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
--------many such lines----------

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:227, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    222         raise ValueError(
    223             f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
    224         )
    225     attn_weights = attn_weights + attention_mask
    226     attn_weights = torch.max(
--> 227         attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
    228     )
    230 # upcast attention to fp32
    231 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

Code does inference successfully.

About this issue

Original URL
State: closed
Created a year ago
Comments: 22 (3 by maintainers)

Most upvoted comments

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

Xnhyacinth on Jul 28, 2023

This issue is not happening after transformers update 4.30.2.

kunaldeo on Jun 23, 2023

Finally solved this by disabling ACS in bios, ref https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

This test is very helpful. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-gpu-communication

abcbdf on Dec 28, 2023

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

@Xnhyacinth were you able to solve this by any chance? I’m getting the same error with transformers 4.33.3 (my full case description is here)

@kerenganon @Xnhyacinth - Were either of you able to solve this? I’m getting the same error. It began when I upgraded the CUDA Driver Version from 11.? to 12.2 and updated the NVIDIA Driver Version to 535.113.01 - so perhaps related to driver versions in some way. Prior to upgrading the drivers I had no issues. After the upgrade, I get this error when I attempt to run inference using Llama models across multiple GPUs. The problem doesn’t occur if I just use a single GPU. I haven’t been able to see any improvement using changes to tokenizer eos or pad token_ids (as suggested elsewhere). The problem seems related to using device_map=“auto” (or similar). I’m using transformers 4.31.0, so it doesn’t seem to be fixed after 4.30.2 for me.

DrGlennn on Oct 24, 2023