exllamav2: Generation never stops
I’m testing it with TheBloke’s airoboros-70b-2.1 in a 2x3090. Generating with the simple generator works great, output is the same quality but much faster than exllama-v1, but I cannot make it to stop generating. It just continue generating until it hit max_new_len tokens. The line:
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
Makes no difference.
UPDATE: this fixes it:
diff --git a/exllamav2/generator/base.py b/exllamav2/generator/base.py
index 63db5f5..c6ed9d6 100644
--- a/exllamav2/generator/base.py
+++ b/exllamav2/generator/base.py
@@ -58,6 +58,8 @@ class ExLlamaV2BaseGenerator:
logits = self.model.forward(self.sequence_ids[:, -1:], self.cache, input_mask = mask).float().cpu()
token, _ = ExLlamaV2Sampler.sample(logits, gen_settings, self.sequence_ids, random.random())
+ if token==self.tokenizer.eos_token_id:
+ break
self.sequence_ids = torch.cat([self.sequence_ids, token], dim = 1)
text = self.tokenizer.decode(self.sequence_ids)
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 21 (9 by maintainers)
Well, this was an oversight on my part. I was too focused on the streaming generator when updating the sampling logic and messing around with grammar and stuff, so
generate_simplewas overlooked in all that.Indeed the
eossignal from the sampler has a different meaning and relates to sampling filters (full implementation still in the works), with the actual EOS handling being delegated to the generator loop.To make matters more confusing, SentencePiece was masking the issue by only decoding up the EOS token anyway.
The latest commit should fix it, though. You do need to handle batches, yes, so if batching it will keep generating until every sequence in the batch has reached an EOS token, replacing any tokens after EOS with padding tokens that are then decoded to nothing.
@turboderp I observed the same issue when using Llama2-70b-chat. Exl2 format 2.3bpw. The eos token id for llama2 is 2. I used the correct prompt format and verified that bos token is added to my prompt by printing the encoded token_ids out. The issue is that when an eos token is generated by the model,
ExLlamaV2Sampler.sampledoesn’t returneosasTrue. I printed the tokens generated and eos returned by the sampler and found out this is the case:The returned
eosis always false, and thus the generation loop never stop until max output token length is reached.In
ExLlamaV2Sampler.sample, this return value is only related to the filter, and I don’t know what a filter is:If the correct behavior for generator is to stop generating when an eos token is generated, then perhaps the stop condition
if eos: breakis wrong since the value of thiseoshas nothing to do with eos token.@ortegaalfredo We fixed the never-ending generation in v2 by adding/prepending BOS
<s>to the prompt via code: New v2 tokenizer doesn’t have the toggle to encode special tokens so it throws away BOS from our prompt at this point, causing the never ending generation for our specific model.Fyi, our model is trained where prompts always start with
<s>so this may not apply to your situation. In any case, print out the tokens after encode to maker sure everything is matching v1 code.But I think there is a bug in the generator code. It never detect the eos_token at all, even if the LLM emits it. I updated the issue with a patch that works. Still the model sometimes gets out of control with some prompts. It seem to be very sensitive to temperature, unlike exllama v1.