vllm: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Hello everyone, I always got this error for Baichuan and LLaMA models. And I found it’s caused by the single_query_cached_kv_attention method in vllm\model_executor\layers\attention.py. After calling of this method, the hidden output has some rows of “nan”. How can I fix this? Thanks!
Still have such errors even after installing xformers from source.
This is my code:
from vllm import LLM, SamplingParams
#from vllm.transformers_utils.configs.baichuan import BaiChuanConfig
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
sampling_params = SamplingParams(temperature=1, top_p=0.95)
llm = LLM(
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
and this is my python environment:
and my GPU info:
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 GRID V100S-32Q On | 00000000:02:01.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 32768MiB | 0% Default |
| | | N/A |
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| No running processes found |
🌡 Have you tried increasing the temperature?
Well try increasing the
value. I had very low temperature value along with other parameters such astop_k
which made the next token distribution too steep and as the beam search’s logic, you will need to have multiple tokens available, and in the low temperature case I couldn’t have (because we know how temperature works, right?)Try increasing the temp value and it should just work, if there are no other complexity involved.
We masked out values in
where the token index is larger than context length, which could avoid corruptedlogits
due tonan
from uninitializedk_cache
, which is good. https://github.com/vllm-project/vllm/blob/d1744376ae9fdbfa6a2dc763e1c67309e138fa3d/csrc/attention/attention_kernels.cu#L186-L189However, we did not mask out values in
where the token index is larger than context length. As a result the followingdot
call is incorrect.https://github.com/vllm-project/vllm/blob/d1744376ae9fdbfa6a2dc763e1c67309e138fa3d/csrc/attention/attention_kernels.cu#L264
0 (from logits_vec) * nan (from v_vec)
, unfortunately.I get similar problems when use llama2-70B, set tensor parallel size to 8 on 8xA100, and change torch.empty to torch.zeros also not work. But when I use same code but only change model to gpt-neox/llama2-7B model it worked. Can someone offer me any ideas with llama2-70B?