transformers: `stopping_criteria` not working with llama

System Info

I am generating text from llama-13b model. But it continues generating even though it met stopping criteria. the stopping criteria works fine with other models such as GPT-J 6B.

I loaded llama-13b by model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', load_in_8bit=True) and my stopping criteria list looks like below

stopping_criteria_list = transformers.StoppingCriteriaList([
        _SentinelTokenStoppingCriteria(
            sentinel_token_ids=tokenizer(
                "\n",
                add_special_tokens=False,
                return_tensors="pt",
            ).input_ids.to("cuda"),
            starting_idx=tokenized_items.input_ids.shape[-1])
    ])

Thank you.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. load lama model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', load_in_8bit=True)
  2. make stopping criteria
stopping_criteria_list = transformers.StoppingCriteriaList([
        _SentinelTokenStoppingCriteria(
            sentinel_token_ids=tokenizer(
                "\n",
                add_special_tokens=False,
                return_tensors="pt",
            ).input_ids.to("cuda"),
            starting_idx=tokenized_items.input_ids.shape[-1])
    ])
...
class _SentinelTokenStoppingCriteria(transformers.StoppingCriteria):

    def __init__(self, sentinel_token_ids: torch.LongTensor,
                 starting_idx: int):
        transformers.StoppingCriteria.__init__(self)
        self.sentinel_token_ids = sentinel_token_ids
        self.starting_idx = starting_idx

    def __call__(self, input_ids: torch.LongTensor,
                 _scores: torch.FloatTensor) -> bool:
        for sample in input_ids:
            trimmed_sample = sample[self.starting_idx:]
            # Can't unfold, output is still too tiny. Skip.
            if trimmed_sample.shape[-1] < self.sentinel_token_ids.shape[-1]:
                continue

            for window in trimmed_sample.unfold(
                    0, self.sentinel_token_ids.shape[-1], 1):
                if torch.all(torch.eq(self.sentinel_token_ids, window)):
                    return True
        return False
  1. generate
model_output = model.generate(stopping_criteria=stopping_criteria_list, 
                                **tokenized_items, **generation_settings,
                                pad_token_id=tokenizer.eos_token_id)

Expected behavior

Stop generating when it generated \n.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 23 (7 by maintainers)

Most upvoted comments

Hey @poohzaza166 👋

I had a look at your snippet, and the problem does not step from the stopping criteria nor the llama model itself, but rather from how the tokenizer works. It also doesn’t seem to be a bug. My recommendation would be to design the stopping criteria from the token ids, and not from raw text 😃

See this example:

Click me
from transformers import LlamaTokenizer
import transformers
import torch


tokenizer = LlamaTokenizer.from_pretrained('huggyllama/llama-7b')


class _SentinelTokenStoppingCriteria(transformers.StoppingCriteria):

    def __init__(self, sentinel_token_ids: torch.LongTensor,
                 starting_idx: int):
        transformers.StoppingCriteria.__init__(self)
        self.sentinel_token_ids = sentinel_token_ids
        self.starting_idx = starting_idx

    def __call__(self, input_ids: torch.LongTensor, _scores: torch.FloatTensor) -> bool:
        for sample in input_ids:
            trimmed_sample = sample[self.starting_idx:]
            # Can't unfold, output is still too tiny. Skip.
            if trimmed_sample.shape[-1] < self.sentinel_token_ids.shape[-1]:
                continue

            for window in trimmed_sample.unfold(0, self.sentinel_token_ids.shape[-1], 1):
                if torch.all(torch.eq(self.sentinel_token_ids, window)):
                    return True
        return False


sentinel_token_ids = tokenizer("pooh:", add_special_tokens=False, return_tensors="pt").input_ids.to("cuda")
print(sentinel_token_ids)

stopping_criteria_list = transformers.StoppingCriteriaList([
    _SentinelTokenStoppingCriteria(sentinel_token_ids=sentinel_token_ids, starting_idx=0)
])

test_input_1 = """This is a test.\npooh: potato."""
test_input_ids = tokenizer(test_input_1, add_special_tokens=False, return_tensors="pt").input_ids.to("cuda")
print(stopping_criteria_list(test_input_ids, None))

test_input_2 = """This is a test. pooh: potato."""
test_input_ids = tokenizer(test_input_2, add_special_tokens=False, return_tensors="pt").input_ids.to("cuda")
print(stopping_criteria_list(test_input_ids, None))

@oobabooga Those issues will be fixed by #22402

I can reproduce the issue. Here is some additional code for testing:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('models/llama-7b/')

>>> tokenizer.encode('\nYou:', add_special_tokens=False)
[29871, 13, 3492, 29901]

>>> tokenizer.decode([29871, 13, 3492, 29901])
' \nYou:'

>>> tokenizer.decode([13, 3492, 29901])
' \nYou:'

There is always an extra space (29871) everywhere. Also,

>>> tokenizer.encode(' ', add_special_tokens=False)
[259]

>>> tokenizer.decode([259])
'  ' # two spaces

>>> tokenizer.decode([29871]) 
' ' # one space

If you encode a space, it becomes id 259 instead of 29871. And if you decode [259], the result is two spaces.

Very confusing behavior overall.