vllm: Decode error while inferencing a batch of prompts

I’m trying to benchmark the performance of vLLM OPT. But I find that when I pass a relatively large batch of prompts to vLLM, it will raise decode error when the sequence length meets a threshold (which makes the problem look like an OOM).

A minimal reproduction for this issue:

from vllm import LLM, SamplingParams

def make_input(bs):
    return ["Hello!" for _ in range(bs)]

bs = 128
generate_length = 200

# Create a sampling params object.
sampling_params = SamplingParams(
    temperature=0.8, 
    top_p=0.95, 
    max_tokens=generate_length)

# Create an LLM.
llm = LLM(
    model="facebook/opt-125m",
    use_dummy_weights=True,
)
input = make_input(bs)
out = llm.generate(input, sampling_params)

When bs=128, the error happens in the 108-th token approximately. The error looks like

Traceback (most recent call last):
  File "vllm-none-problem-repro.py", line 21, in <module>
    out = llm.generate(input, sampling_params)
  File "/llm-bench/vllm-src/vllm/entrypoints/llm.py", line 127, in generate
    return self._run_engine(use_tqdm)
  File "/llm-bench/vllm-src/vllm/entrypoints/llm.py", line 147, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/llm-bench/vllm-src/vllm/engine/llm_engine.py", line 246, in step
    self._decode_sequences(seq_groups)
  File "/llm-bench/vllm-src/vllm/engine/llm_engine.py", line 263, in _decode_sequences
    new_token, new_output_text = detokenize_incrementally(
  File "/llm-bench/vllm-src/vllm/transformers_utils/tokenizer.py", line 73, in detokenize_incrementally
    output_text = tokenizer.convert_tokens_to_string(output_tokens)
  File "/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 533, in convert_tokens_to_string
    return self.backend_tokenizer.decoder.decode(tokens)
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString

If I use a smaller bs, the “threshold” will also increase (>108). For example, it’s around 210 when bs=64. Seems that there is a limit for bs * length.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 7
  • Comments: 19 (4 by maintainers)

Most upvoted comments

I found that the batch size is only indirectly the reason and it doesn’t have anything to do with OOM or similar things. For example, if I just change the random seed to the following and keep the sequence length and batch size the same, then the bug doesn’t happen anymore for this specific batch size, but it will happen for another larger one:

# Create an LLM.
llm = LLM(
    model="facebook/opt-125m",
    use_dummy_weights=True,
    seed=2,
)

The reason is that for some models there can be a mismatch between the config.vocab_size and the len(tokenizer). The model outputs a distribution over tokens in the range vocab_size, but only tokens in the range len(tokenizer) should actually be sampled. The remaining tokens are just padding and when sampling these tokens and decoding them, the result will be None instead of a string and so the exception will be thrown.

To fix this, if I change the config.vocab_size (which is 50272 for facebook/opt-125m) in the following line to 50265, i.e. len(tokenizer), then the bug doesn’t happen anymore for any seed and batch size.

https://github.com/vllm-project/vllm/blob/58a072be15a4e36bee006d1c9a962e527819cf18/vllm/model_executor/models/opt.py#L276

I have also observed this for LLaMA & LLaMA-2 where it seems like for some models on huggingface the vocab_size does correspond to the actual number of tokens that should be sampled while for some others it doesn’t. It depends on whether the number of tokens is already a multiple of 16 or if there needs to be padding. There might also be other models than OPT and LLaMA where this happens.

from transformers import AutoTokenizer, PretrainedConfig

print(len(AutoTokenizer.from_pretrained('meta-llama/Llama-2-13b-hf'))) # 32000
print(PretrainedConfig.from_pretrained('meta-llama/Llama-2-13b-hf').vocab_size) # 32000

print(len(AutoTokenizer.from_pretrained('NousResearch/Nous-Hermes-Llama2-13b'))) # 32001
print(PretrainedConfig.from_pretrained('NousResearch/Nous-Hermes-Llama2-13b').vocab_size) # 32032

A fix in vLLM could be to obtain the number of tokens from the tokenizer instead of the config.json file.

This is still an issue…

@esmeetu any suggestion here to have a good default way to support fine-tuned model?

I took a look at how transformers deals with this problem. Their idea is simple: if we get a token id larger than the length of tokenizer length, the decode step just regard the token as an empty string.

Here is a demo:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
print(len(tokenizer)) # 50265
tokenizer.decode([340]) # a token id that exists, i.e. ' news'
tokenizer.decode([34000000]) # a token id that does not exist, returns empty string ''

I think the issue is that the vocab_size is expanded to be a nice multiple for the gpu during training. These tokens will not be trained (since there is nothing in the dataset), so they are very unlikely to be sampled.

@simon-mo

I think an approach we could take is to expand the tokenizer to have more pad tokens in this scenario. This will allow

  • vocab_size to be a nice multiple for the gpus
  • gracefully handle a case where one of the “fake” tokens are predicted

Thoughts?

  1. I met the same problems. I cannot fix the problem by switching vocab_size in config.josn because I am using CodeLLaMA, which use vocab_parallel_embedding and there is an assertion in it when loading weights. In order to use vocab parallel, it is also necessary for the model’s vocab to be 2^n*k like.
  2. Hard coding in LLaMAForCausalLM: self.sampler = Sampler(1111111) # for example does solve the problem.
  3. Simple way to fix this bug: Write another attribute whose value is len(tokenizer) (e.g., write tokenizer_vocab_size into model’s config when initializing LLM) then use self.sampler = Sampler(config.tokenizer_vocab_size) instead.

Hope those information will help.

@tju01 Thank you! Now it looks like larger batch size and sequence length are just to increase the probability this error happens.