transformers: `stopping_criteria` not working with llama

System Info

I am generating text from llama-13b model. But it continues generating even though it met stopping criteria. the stopping criteria works fine with other models such as GPT-J 6B.

I loaded llama-13b by model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', load_in_8bit=True) and my stopping criteria list looks like below

stopping_criteria_list = transformers.StoppingCriteriaList([
        _SentinelTokenStoppingCriteria(
            sentinel_token_ids=tokenizer(
                "\n",
                add_special_tokens=False,
                return_tensors="pt",
            ).input_ids.to("cuda"),
            starting_idx=tokenized_items.input_ids.shape[-1])
    ])

Thank you.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

load lama model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', load_in_8bit=True)
make stopping criteria

stopping_criteria_list = transformers.StoppingCriteriaList([
        _SentinelTokenStoppingCriteria(
            sentinel_token_ids=tokenizer(
                "\n",
                add_special_tokens=False,
                return_tensors="pt",
            ).input_ids.to("cuda"),
            starting_idx=tokenized_items.input_ids.shape[-1])
    ])
...
class _SentinelTokenStoppingCriteria(transformers.StoppingCriteria):

    def __init__(self, sentinel_token_ids: torch.LongTensor,
                 starting_idx: int):
        transformers.StoppingCriteria.__init__(self)
        self.sentinel_token_ids = sentinel_token_ids
        self.starting_idx = starting_idx

    def __call__(self, input_ids: torch.LongTensor,
                 _scores: torch.FloatTensor) -> bool:
        for sample in input_ids:
            trimmed_sample = sample[self.starting_idx:]
            # Can't unfold, output is still too tiny. Skip.
            if trimmed_sample.shape[-1] < self.sentinel_token_ids.shape[-1]:
                continue

            for window in trimmed_sample.unfold(
                    0, self.sentinel_token_ids.shape[-1], 1):
                if torch.all(torch.eq(self.sentinel_token_ids, window)):
                    return True
        return False

generate

model_output = model.generate(stopping_criteria=stopping_criteria_list, 
                                **tokenized_items, **generation_settings,
                                pad_token_id=tokenizer.eos_token_id)

Expected behavior

Stop generating when it generated \n.

About this issue

Original URL
State: closed
Created a year ago
Comments: 23 (7 by maintainers)

Most upvoted comments

Hey @poohzaza166 👋

I had a look at your snippet, and the problem does not step from the stopping criteria nor the llama model itself, but rather from how the tokenizer works. It also doesn’t seem to be a bug. My recommendation would be to design the stopping criteria from the token ids, and not from raw text 😃

See this example:

Click me

from transformers import LlamaTokenizer
import transformers
import torch


tokenizer = LlamaTokenizer.from_pretrained('huggyllama/llama-7b')


class _SentinelTokenStoppingCriteria(transformers.StoppingCriteria):

    def __init__(self, sentinel_token_ids: torch.LongTensor,
                 starting_idx: int):
        transformers.StoppingCriteria.__init__(self)
        self.sentinel_token_ids = sentinel_token_ids
        self.starting_idx = starting_idx

    def __call__(self, input_ids: torch.LongTensor, _scores: torch.FloatTensor) -> bool:
        for sample in input_ids:
            trimmed_sample = sample[self.starting_idx:]
            # Can't unfold, output is still too tiny. Skip.
            if trimmed_sample.shape[-1] < self.sentinel_token_ids.shape[-1]:
                continue

            for window in trimmed_sample.unfold(0, self.sentinel_token_ids.shape[-1], 1):
                if torch.all(torch.eq(self.sentinel_token_ids, window)):
                    return True
        return False


sentinel_token_ids = tokenizer("pooh:", add_special_tokens=False, return_tensors="pt").input_ids.to("cuda")
print(sentinel_token_ids)

stopping_criteria_list = transformers.StoppingCriteriaList([
    _SentinelTokenStoppingCriteria(sentinel_token_ids=sentinel_token_ids, starting_idx=0)
])

test_input_1 = """This is a test.\npooh: potato."""
test_input_ids = tokenizer(test_input_1, add_special_tokens=False, return_tensors="pt").input_ids.to("cuda")
print(stopping_criteria_list(test_input_ids, None))

test_input_2 = """This is a test. pooh: potato."""
test_input_ids = tokenizer(test_input_2, add_special_tokens=False, return_tensors="pt").input_ids.to("cuda")
print(stopping_criteria_list(test_input_ids, None))

gante on May 4, 2023

@oobabooga Those issues will be fixed by #22402

sgugger on Mar 31, 2023

I can reproduce the issue. Here is some additional code for testing:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('models/llama-7b/')

>>> tokenizer.encode('\nYou:', add_special_tokens=False)
[29871, 13, 3492, 29901]

>>> tokenizer.decode([29871, 13, 3492, 29901])
' \nYou:'

>>> tokenizer.decode([13, 3492, 29901])
' \nYou:'

There is always an extra space (29871) everywhere. Also,

>>> tokenizer.encode(' ', add_special_tokens=False)
[259]

>>> tokenizer.decode([259])
'  ' # two spaces

>>> tokenizer.decode([29871]) 
' ' # one space

If you encode a space, it becomes id 259 instead of 29871. And if you decode [259], the result is two spaces.

Very confusing behavior overall.

oobabooga on Mar 31, 2023