FastChat: RuntimeError: CUDA error: device-side assert triggered when running Llama on multiple gpus
I’m getting the following error when using more than one gpu
python3 -m fastchat.serve.cli --model-name /tmp/cache/vicuna-13b/ --num-gpus 2
I am unsure if this is a problem on my end or if it’s something that can be fixed. Can you please confirm if using multiple GPUs is supported by FastChat and if there are any specific requirements that must be met? Thank you.
I’m using 4xV100 - 32GB , and yes I’ve already tried with 2 and 4 combinations.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (1 by maintainers)
@zhisbug Hi, I have updated the fschat and transformers package to the latest version , and reconverted the model format to huggingface format, but the error before mentioned still exists when running the client on two RTX4090 gpus. I think this issue hasn’t been solved, may you reopen the issue?
In case it helps anybody else: The problem posted by @starphantom666 may be a different problem from the original post (OP) above. A problem consistent with the @starphantom666 report can occur because of the recently changed handling of BOS/EOS tokens in the Hugging Face (“HF”) Llama implementation.
"<s>"
) is wrongly represented by ID#32000
, whereas the embeddings now expect1
.export CUDA_LAUNCH_BLOCKING=1
, then one will see an assertion message of the form... Indexing.cu: ... indexSelectLargeIndex: ... Assertion srcIndex < srcSelectDimSize failed.
from a CUDA kernel launched by the torch embedding function – presumably because the 32000 is past the end of the embedding table.A solution is to update both HF transformers and FastChat repos to the latest and re-convert weights from the original weights to HF weights, and then from HF weights to vicuna weights.
I also have the same problem. I can use the vicuna 13b model properly with --load-8bit option on in single 4090 GPU, but when I use multiple gpus like (–num-gpus 2), this problem occured. I’m still seeing the same traceback message and I can’t figure out why.
I got the same problem on a dual 4090 machine. I tried the same command with two 3090s and it worked well. I guessed it was the problem of driver/cuda version, but then I did some searching and found the following post: https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366
It seems 4090 does not support communication between multiple cards at all. I am not 100% sure if this is the root cause since I am not an expert in this domain. Can someone double-check it? Thanks.
=== Updated below ===
I tried setting NCCL_P2P_DISABLE=1 and ran another code for training LoRA with two 4090s. Now it works (it used to be stuck).
But when I try running Vicuna with P2P disabled, it quit and reported another error:
No, loading the model with dual 4090s still doesn’t work. I currently use a single card with the
--load-8bit
quantization option as a workaround. It should have very little performance degradation. Hope this helps.