vllm: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Hello everyone, I always got this error for Baichuan and LLaMA models. And I found it’s caused by the single_query_cached_kv_attention method in vllm\model_executor\layers\attention.py. After calling of this method, the hidden output has some rows of “nan”. How can I fix this? Thanks!
Still have such errors even after installing xformers from source.
This is my code:
from vllm import LLM, SamplingParams
#from vllm.transformers_utils.configs.baichuan import BaiChuanConfig
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=1, top_p=0.95)
llm = LLM(
model="/.../Baichuan-7b",
trust_remote_code=True,
dtype='float16',
gpu_memory_utilization=0.85,
tokenizer_mode="slow"
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
and this is my python environment:
accelerate 0.21.0
aiofiles 23.1.0
aiohttp 3.8.5
aiosignal 1.3.1
altair 5.0.1
annotated-types 0.5.0
anyio 3.7.1
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.2.1
async-lru 2.0.3
async-timeout 4.0.2
attrs 23.1.0
Babel 2.12.1
backcall 0.2.0
beautifulsoup4 4.12.2
bleach 6.0.0
blinker 1.6.2
boltons 23.0.0
brotlipy 0.7.0
certifi 2022.12.7
cffi 1.15.1
charset-normalizer 2.0.4
click 8.1.6
cmake 3.27.0
comm 0.1.3
conda 23.3.1
conda-content-trust 0.1.3
conda-package-handling 2.0.2
conda_package_streaming 0.7.0
contourpy 1.1.0
cryptography 39.0.1
cycler 0.11.0
datasets 2.14.0
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.7
distlib 0.3.7
docker-pycreds 0.4.0
editables 0.5
exceptiongroup 1.1.2
executing 1.2.0
fastapi 0.100.0
fastjsonschema 2.18.0
ffmpy 0.3.1
filelock 3.12.2
Flask 2.3.2
fonttools 4.41.1
fqdn 1.5.1
frozenlist 1.4.0
fsspec 2023.6.0
gitdb 4.0.10
GitPython 3.1.32
gradio 3.35.2
gradio_client 0.2.10
grpcio 1.56.2
h11 0.14.0
hatchling 1.18.0
httpcore 0.17.3
httpx 0.24.1
huggingface-hub 0.16.4
idna 3.4
ipykernel 6.24.0
ipython 8.14.0
ipython-genutils 0.2.0
ipywidgets 8.0.7
isoduration 20.11.0
itsdangerous 2.1.2
jedi 0.18.2
jieba 0.42.1
Jinja2 3.1.2
joblib 1.3.1
json5 0.9.14
jsonpatch 1.32
jsonpointer 2.1
jsonschema 4.18.4
jsonschema-specifications 2023.7.1
jupyter 1.0.0
jupyter_client 8.3.0
jupyter-console 6.6.3
jupyter_core 5.3.1
jupyter-events 0.6.3
jupyter-lsp 2.2.0
jupyter_server 2.7.0
jupyter_server_terminals 0.4.4
jupyterlab 4.0.3
jupyterlab-pygments 0.2.2
jupyterlab_server 2.24.0
jupyterlab-widgets 3.0.8
kiwisolver 1.4.4
linkify-it-py 2.0.2
lit 16.0.6
markdown-it-py 2.2.0
markdown2 2.4.10
MarkupSafe 2.1.3
matplotlib 3.7.2
matplotlib-inline 0.1.6
mdit-py-plugins 0.3.3
mdurl 0.1.2
mistune 3.0.1
mpmath 1.3.0
msgpack 1.0.5
multidict 6.0.4
multiprocess 0.70.15
mypy-extensions 1.0.0
nbclient 0.8.0
nbconvert 7.7.2
nbformat 5.9.1
nest-asyncio 1.5.6
networkx 3.1
nh3 0.2.14
ninja 1.11.1
nltk 3.8.1
notebook 7.0.0
notebook_shim 0.2.3
numpy 1.25.1
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
orjson 3.9.2
overrides 7.3.1
packaging 23.0
pandas 2.0.3
pandocfilters 1.5.0
parso 0.8.3
pathspec 0.11.1
pathtools 0.1.2
peft 0.4.0
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.0.0
pip 23.0.1
platformdirs 3.9.1
pluggy 1.0.0
prometheus-client 0.17.1
prompt-toolkit 3.0.39
protobuf 4.23.4
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 12.0.1
pycosat 0.6.4
pycparser 2.21
pydantic 1.10.12
pydantic_core 2.3.0
pydub 0.25.1
Pygments 2.15.1
pyOpenSSL 23.0.0
pyparsing 3.0.9
pyre-extensions 0.0.29
PySocks 1.7.1
python-dateutil 2.8.2
python-json-logger 2.0.7
python-multipart 0.0.6
pytz 2023.3
PyYAML 6.0.1
pyzmq 25.1.0
qtconsole 5.4.3
QtPy 2.3.1
ray 2.6.1
referencing 0.30.0
regex 2023.6.3
requests 2.28.1
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.4.2
rouge-chinese 1.0.3
rpds-py 0.9.2
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.6
safetensors 0.3.1
semantic-version 2.10.0
Send2Trash 1.8.2
sentencepiece 0.1.99
sentry-sdk 1.28.1
setproctitle 1.3.2
setuptools 65.6.3
shortuuid 1.0.11
six 1.16.0
smmap 5.0.0
sniffio 1.3.0
soupsieve 2.4.1
stack-data 0.6.2
starlette 0.27.0
svgwrite 1.4.3
sympy 1.12
terminado 0.17.1
tinycss2 1.2.1
tokenizers 0.13.3
tomli 2.0.1
toolz 0.12.0
torch 2.0.1
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.31.0
triton 2.0.0
trl 0.4.7
trove-classifiers 2023.7.6
typing_extensions 4.7.1
typing-inspect 0.9.0
tzdata 2023.3
uc-micro-py 1.0.2
uri-template 1.3.0
urllib3 1.26.15
uvicorn 0.23.1
virtualenv 20.24.2
vllm 0.1.2 /.../feng/OpenSource/vllm
wandb 0.15.7
wavedrom 2.0.3.post3
wcwidth 0.2.6
webcolors 1.13
webencodings 0.5.1
websocket-client 1.6.1
websockets 11.0.3
Werkzeug 2.3.6
wheel 0.38.4
widgetsnbextension 4.0.8
xformers 0.0.20
xxhash 3.2.0
yarl 1.9.2
zstandard 0.19.0
and my GPU info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID V100S-32Q On | 00000000:02:01.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17
🌡 Have you tried increasing the temperature?
Well try increasing the
temperature
value. I had very low temperature value along with other parameters such astop_k
andtop_p
which made the next token distribution too steep and as the beam search’s logic, you will need to have multiple tokens available, and in the low temperature case I couldn’t have (because we know how temperature works, right?)Try increasing the temp value and it should just work, if there are no other complexity involved.
We masked out values in
logits
where the token index is larger than context length, which could avoid corruptedlogits
due tonan
from uninitializedk_cache
, which is good. https://github.com/vllm-project/vllm/blob/d1744376ae9fdbfa6a2dc763e1c67309e138fa3d/csrc/attention/attention_kernels.cu#L186-L189However, we did not mask out values in
v_vec
where the token index is larger than context length. As a result the followingdot
call is incorrect.https://github.com/vllm-project/vllm/blob/d1744376ae9fdbfa6a2dc763e1c67309e138fa3d/csrc/attention/attention_kernels.cu#L264
0 (from logits_vec) * nan (from v_vec)
isnan
, unfortunately.I get similar problems when use llama2-70B, set tensor parallel size to 8 on 8xA100, and change torch.empty to torch.zeros also not work. But when I use same code but only change model to gpt-neox/llama2-7B model it worked. Can someone offer me any ideas with llama2-70B?