exllamav2: Generation never stops

I’m testing it with TheBloke’s airoboros-70b-2.1 in a 2x3090. Generating with the simple generator works great, output is the same quality but much faster than exllama-v1, but I cannot make it to stop generating. It just continue generating until it hit max_new_len tokens. The line:

            settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

Makes no difference.

UPDATE: this fixes it:

diff --git a/exllamav2/generator/base.py b/exllamav2/generator/base.py
index 63db5f5..c6ed9d6 100644
--- a/exllamav2/generator/base.py
+++ b/exllamav2/generator/base.py
@@ -58,6 +58,8 @@ class ExLlamaV2BaseGenerator:
 
             logits = self.model.forward(self.sequence_ids[:, -1:], self.cache, input_mask = mask).float().cpu()
             token, _ = ExLlamaV2Sampler.sample(logits, gen_settings, self.sequence_ids, random.random())
+            if token==self.tokenizer.eos_token_id:
+                break
             self.sequence_ids = torch.cat([self.sequence_ids, token], dim = 1)
 
         text = self.tokenizer.decode(self.sequence_ids)

About this issue

Original URL
State: closed
Created 10 months ago
Comments: 21 (9 by maintainers)

Most upvoted comments

Well, this was an oversight on my part. I was too focused on the streaming generator when updating the sampling logic and messing around with grammar and stuff, so generate_simple was overlooked in all that.

Indeed the eos signal from the sampler has a different meaning and relates to sampling filters (full implementation still in the works), with the actual EOS handling being delegated to the generator loop.

To make matters more confusing, SentencePiece was masking the issue by only decoding up the EOS token anyway.

The latest commit should fix it, though. You do need to handle batches, yes, so if batching it will keep generating until every sequence in the batch has reached an EOS token, replacing any tokens after EOS with padding tokens that are then decoded to nothing.

turboderp on Oct 27, 2023

@turboderp I observed the same issue when using Llama2-70b-chat. Exl2 format 2.3bpw. The eos token id for llama2 is 2. I used the correct prompt format and verified that bos token is added to my prompt by printing the encoded token_ids out. The issue is that when an eos token is generated by the model, ExLlamaV2Sampler.sample doesn’t return eos as True. I printed the tokens generated and eos returned by the sampler and found out this is the case:

for i in range(num_tokens):

    logits = self.model.forward(self.sequence_ids[:, -1:], self.cache, input_mask = mask, loras = loras).float().cpu()
    token, _, eos = ExLlamaV2Sampler.sample(logits, gen_settings, self.sequence_ids, random.random(), self.tokenizer, prefix_token = unhealed_token)
    self.sequence_ids = torch.cat([self.sequence_ids, token], dim = 1)
    gen_settings.feed_filters(token)

    print("token: {}, eos:{}".format(token[0].item(), eos))
    unhealed_token = None
    if eos: break

The returned eos is always false, and thus the generation loop never stop until max output token length is reached.

In ExLlamaV2Sampler.sample, this return value is only related to the filter, and I don’t know what a filter is:

If the correct behavior for generator is to stop generating when an eos token is generated, then perhaps the stop condition if eos: break is wrong since the value of this eos has nothing to do with eos token.

RisaKirisu on Oct 26, 2023

@ortegaalfredo We fixed the never-ending generation in v2 by adding/prepending BOS <s> to the prompt via code: New v2 tokenizer doesn’t have the toggle to encode special tokens so it throws away BOS from our prompt at this point, causing the never ending generation for our specific model.

self.tokenizer.encode(prompt, add_bos=True)

Fyi, our model is trained where prompts always start with <s> so this may not apply to your situation. In any case, print out the tokens after encode to maker sure everything is matching v1 code.

Qubitium on Sep 15, 2023

But I think there is a bug in the generator code. It never detect the eos_token at all, even if the LLM emits it. I updated the issue with a patch that works. Still the model sometimes gets out of control with some prompts. It seem to be very sensitive to temperature, unlike exllama v1.

ortegaalfredo on Sep 13, 2023