vllm: CUDA error: an illegal memory acces with Falcon 40B
Hi,
I am testing different models with vllm. I see
CUDA error: an illegal memory access
when I use falcon 40 b. The code I use is
llm = LLM(model=ckpt_dir,tensor_parallel_size=4,trust_remote_code=True,gpu_memory_utilization=0.8)
sampling_params = SamplingParams(temperature=0, top_p=1.0, max_tokens=300)
results = llm.generate(prompts, sampling_params)
I am using an A100 with 4 GPUs. Please let me know if you have any questions
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (5 by maintainers)
That’s interesting! I followed the bug and found some additional information:
from_pretrained
method, where the parent classPretrainedConfig
initializer is called. This initializer overwrites themodel_type
class variable with the value fromconfig.json
. This explains whyhf_config.model_type
equalsRefinedWeb
.PretrainedConfig.to_dict()
method is used. Inside this method, themodel_type
is set back to__class__.model_type
. This can be seen here. In this case,__class__
refers toRWConfig
, whose model_type is falcon.@LopezGG @lw921014 @tju01 @JasmondL We just merged #992, which fixed this issue. Please install vLLM from source and try again. Sorry for the inconvenience.
@chu-tianxiang Thanks for the very detailed explanation! It helps a lot!
Using aa84c92ef636e689b506b9842c712e5c615cc73a makes it work for me while 621980bdc0d5a41e224febf962a6e0474e2b14ef gives me the same error. So it seems like the error was introduced in 621980bdc0d5a41e224febf962a6e0474e2b14ef. I am using
OpenAssistant/falcon-40b-sft-top1-560
on 4xA6000 and tried it both withtensor_parallel_size=4
andtensor_parallel_size=2
.Hi,
I am facing the same issue with 4 GPU A10G.