vllm: CUDA error: an illegal memory acces with Falcon 40B

Hi, I am testing different models with vllm. I see CUDA error: an illegal memory access when I use falcon 40 b. The code I use is

llm = LLM(model=ckpt_dir,tensor_parallel_size=4,trust_remote_code=True,gpu_memory_utilization=0.8)
sampling_params = SamplingParams(temperature=0, top_p=1.0, max_tokens=300)
results = llm.generate(prompts, sampling_params)

I am using an A100 with 4 GPUs. Please let me know if you have any questions

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

I used the following code for finding out where the issue is:

import vllm
vllm.LLM('OpenAssistant/falcon-40b-sft-mix-1226', trust_remote_code=True, tensor_parallel_size=2)

The logic of the commit 621980b itself seems to be correct. However, there seems to be a previous issue with the check for whether a model is of type falcon that just got relevant through this commit. The check is in the following line:

https://github.com/vllm-project/vllm/blob/d2b2eed67c49cdda3c1d6fa09ee2ec128b318138/vllm/config.py#L102

If I add the following code before this check:

print('self.hf_config.model_type =', self.hf_config.model_type)

Then I get the following result:

(RayWorker pid=186681) self.hf_config.model_type = RefinedWeb

However, if I add the following:

print('self.hf_config =', self.hf_config)

I get this result:

(RayWorker pid=186681) self.hf_config = RWConfig {
[...]
(RayWorker pid=186681)   "model_type": "falcon",
(RayWorker pid=186681)   "multi_query": true,
[...]
(RayWorker pid=186681) }

This clearly states "model_type": "falcon" and not RefinedWeb.

I don’t understand this. I also get the warning You are using a model of type RefinedWeb to instantiate a model of type falcon. This is not supported for all configurations of models and can yield errors. at the beginning which is probably significant here.

That’s interesting! I followed the bug and found some additional information:

  1. The bug occurs in the from_pretrained method, where the parent class PretrainedConfig initializer is called. This initializer overwrites the model_type class variable with the value from config.json. This explains why hf_config.model_type equals RefinedWeb.
  2. However, when printing out the config, the PretrainedConfig.to_dict() method is used. Inside this method, the model_type is set back to __class__.model_type. This can be seen here. In this case, __class__ refers to RWConfig, whose model_type is falcon.

@LopezGG @lw921014 @tju01 @JasmondL We just merged #992, which fixed this issue. Please install vLLM from source and try again. Sorry for the inconvenience.

@chu-tianxiang Thanks for the very detailed explanation! It helps a lot!

Using aa84c92ef636e689b506b9842c712e5c615cc73a makes it work for me while 621980bdc0d5a41e224febf962a6e0474e2b14ef gives me the same error. So it seems like the error was introduced in 621980bdc0d5a41e224febf962a6e0474e2b14ef. I am using OpenAssistant/falcon-40b-sft-top1-560 on 4xA6000 and tried it both with tensor_parallel_size=4 and tensor_parallel_size=2.

Hi,

I am facing the same issue with 4 GPU A10G.