vllm: Decode error while inferencing a batch of prompts
I’m trying to benchmark the performance of vLLM OPT. But I find that when I pass a relatively large batch of prompts to vLLM, it will raise decode error when the sequence length meets a threshold (which makes the problem look like an OOM).
A minimal reproduction for this issue:
from vllm import LLM, SamplingParams
def make_input(bs):
return ["Hello!" for _ in range(bs)]
bs = 128
generate_length = 200
# Create a sampling params object.
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=generate_length)
# Create an LLM.
llm = LLM(
model="facebook/opt-125m",
use_dummy_weights=True,
)
input = make_input(bs)
out = llm.generate(input, sampling_params)
When bs=128
, the error happens in the 108-th token approximately. The error looks like
Traceback (most recent call last):
File "vllm-none-problem-repro.py", line 21, in <module>
out = llm.generate(input, sampling_params)
File "/llm-bench/vllm-src/vllm/entrypoints/llm.py", line 127, in generate
return self._run_engine(use_tqdm)
File "/llm-bench/vllm-src/vllm/entrypoints/llm.py", line 147, in _run_engine
step_outputs = self.llm_engine.step()
File "/llm-bench/vllm-src/vllm/engine/llm_engine.py", line 246, in step
self._decode_sequences(seq_groups)
File "/llm-bench/vllm-src/vllm/engine/llm_engine.py", line 263, in _decode_sequences
new_token, new_output_text = detokenize_incrementally(
File "/llm-bench/vllm-src/vllm/transformers_utils/tokenizer.py", line 73, in detokenize_incrementally
output_text = tokenizer.convert_tokens_to_string(output_tokens)
File "/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 533, in convert_tokens_to_string
return self.backend_tokenizer.decoder.decode(tokens)
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString
If I use a smaller bs, the “threshold” will also increase (>108). For example, it’s around 210 when bs=64
. Seems that there is a limit for bs * length
.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 7
- Comments: 19 (4 by maintainers)
I found that the batch size is only indirectly the reason and it doesn’t have anything to do with OOM or similar things. For example, if I just change the random seed to the following and keep the sequence length and batch size the same, then the bug doesn’t happen anymore for this specific batch size, but it will happen for another larger one:
The reason is that for some models there can be a mismatch between the config.vocab_size and the len(tokenizer). The model outputs a distribution over tokens in the range
vocab_size
, but only tokens in the rangelen(tokenizer)
should actually be sampled. The remaining tokens are just padding and when sampling these tokens and decoding them, the result will beNone
instead of a string and so the exception will be thrown.To fix this, if I change the
config.vocab_size
(which is 50272 forfacebook/opt-125m
) in the following line to 50265, i.e.len(tokenizer)
, then the bug doesn’t happen anymore for any seed and batch size.https://github.com/vllm-project/vllm/blob/58a072be15a4e36bee006d1c9a962e527819cf18/vllm/model_executor/models/opt.py#L276
I have also observed this for LLaMA & LLaMA-2 where it seems like for some models on huggingface the
vocab_size
does correspond to the actual number of tokens that should be sampled while for some others it doesn’t. It depends on whether the number of tokens is already a multiple of 16 or if there needs to be padding. There might also be other models than OPT and LLaMA where this happens.A fix in vLLM could be to obtain the number of tokens from the tokenizer instead of the
config.json
file.This is still an issue…
@esmeetu any suggestion here to have a good default way to support fine-tuned model?
I took a look at how
transformers
deals with this problem. Their idea is simple: if we get a token id larger than the length of tokenizer length, the decode step just regard the token as an empty string.Here is a demo:
I think the issue is that the
vocab_size
is expanded to be a nice multiple for the gpu during training. These tokens will not be trained (since there is nothing in the dataset), so they are very unlikely to be sampled.@simon-mo
I think an approach we could take is to expand the tokenizer to have more pad tokens in this scenario. This will allow
vocab_size
to be a nice multiple for the gpusThoughts?
vocab_size
in config.josn because I am using CodeLLaMA, which usevocab_parallel_embedding
and there is an assertion in it when loading weights. In order to use vocab parallel, it is also necessary for the model’s vocab to be 2^n*k like.self.sampler = Sampler(1111111) # for example
does solve the problem.tokenizer_vocab_size
into model’s config when initializing LLM) then useself.sampler = Sampler(config.tokenizer_vocab_size)
instead.Hope those information will help.
@tju01 Thank you! Now it looks like larger batch size and sequence length are just to increase the probability this error happens.