transformers: Multi GPU inference on RTX 4090 fails with RuntimeError: CUDA error: device-side assert triggered (Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.)
System Info
transformers version: 4.30.0.dev0
Platform: Linux 6.3.5-zen2-1-zen-x86_64-with-glibc2.37.3 on Arch
Python version: 3.10.9
PyTorch version (GPU): 2.0.1+cu118 (True)
peft version: 0.4.0.dev0
accelerate version: 0.20.0.dev0
bitsandbytes version: 0.39.0
nvidia driver version: nvidia-dkms-530.41.03-1
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Run the following code Code:
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
model_name = "/models/wizard-vicuna-13B-HF"
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name,
device_map='auto',
torch_dtype=torch.float16,
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_length=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
prompt = 'What are the difference between Llamas, Alpacas and Vicunas?'
raw_output = pipe(get_prompt(prompt))
parse_text(raw_output)
While this code works fine on a single 4090 GPU. Loading any model for inference with 2 or 3 RTX 4090 is resulting in the following error:
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [68,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [70,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [72,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [73,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [74,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [75,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
--------many such lines----------
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:227, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
222 raise ValueError(
223 f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
224 )
225 attn_weights = attn_weights + attention_mask
226 attn_weights = torch.max(
--> 227 attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
228 )
230 # upcast attention to fp32
231 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Expected behavior
Code does inference successfully.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (3 by maintainers)
Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal
This issue is not happening after transformers update
4.30.2.Finally solved this by disabling ACS in bios, ref https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs
This test is very helpful. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-gpu-communication
@kerenganon @Xnhyacinth - Were either of you able to solve this? I’m getting the same error. It began when I upgraded the CUDA Driver Version from 11.? to 12.2 and updated the NVIDIA Driver Version to 535.113.01 - so perhaps related to driver versions in some way. Prior to upgrading the drivers I had no issues. After the upgrade, I get this error when I attempt to run inference using Llama models across multiple GPUs. The problem doesn’t occur if I just use a single GPU. I haven’t been able to see any improvement using changes to tokenizer eos or pad token_ids (as suggested elsewhere). The problem seems related to using device_map=“auto” (or similar). I’m using transformers 4.31.0, so it doesn’t seem to be fixed after 4.30.2 for me.