transformers: Gemma-7b is not working properly. There is a logical bug somewhere.

Reopening issue about gemma-7b prediction values.

This issue is still not solved: The perplexity values of gemma-2b and gemma-7b (much worse, near random) are very different. Wikitext-v2 token perplexity for gemma-2b ~= 21. For gemma-7b it is a very large value ~= 1e13.

Not sure of the reason, but it does have to be a problem with the implementation, it might be because of the weights, or some embedding/tokenizer mismatch.

_Originally posted by @alisafaya in https://github.com/huggingface/transformers/issues/29181#issuecomment-1961539845_

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 21 (4 by maintainers)

Most upvoted comments

This is not related to the context size. Perplexity values close to 1.0 means that the loss value is close to 0. I checked the script you shared, and it has small bug.

input_ids[0] = 2 # give a bos token

Converts the whole input into a sequence of:

<bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos>

This is the reason of the very low perplexity. It should be:

input_ids[:, 0] = 2 # give a bos token

The main issue seems to be related to the bos token. I identified two main issues:

  • The 2B version works fine regardless of the bos-token. Whereas the 7B does not work unless bos-token is present.
  • The 2B version does not support 8192 context size. It works with 4096 fine. Did not try other values.

I updated the script as follows:

import torch
from tqdm import tqdm

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model_id = "google/gemma-7b"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer(test["text"], add_special_tokens=False) # use tokenizer parallelism
encodings.input_ids = torch.tensor([sum(encodings.input_ids, [])])

max_length = 4096 
stride = 2048

seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    input_ids[:, 0] = 2 # bos token
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)
    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())
print("Perplexity:", ppl)

Now the token perplexity:

  • gemma-7b = 6.1250
  • gemma-2b = 7.7500
  • no-bos: gemma-2b = 8.1250
  • no-bos: gemma-7b = 8.0111e+08

This should be added to the documentation or fixed somehow in the configuration files. After that we can close this issue.

No, I do not.

Btw, token perplexity is not directly comparable across models with different tokenizers.

I advise using bits-per-char or negative log likelihood per character. (Sum total loss over the whole test set and averaging per number of characters or bytes.)

For reference check the appendix of the Megatron blog here: https://nv-adlr.github.io/MegatronLM

On Wed, Feb 28, 2024, 21:09 Vincent Nguyen @.***> wrote:

using the exact same setup do you have the number for mistral7b and llama2-7B ?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/29250#issuecomment-1969559914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFK4JSOAZGNVPTNPC3U5LRLYV5XHLAVCNFSM6AAAAABDW5MNN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRZGU2TSOJRGQ . You are receiving this because you were mentioned.Message ID: @.***>

It’s very specific to Gemma and more so to gemma-7b . We can have the tokenizer warning users if bos_token is not set, otherwise just a tip / warning in the gemma.md should be good.

add_special_tokens=False is the user disabling something

Going from 1e13 to 1 seems pretty good already no?