transformers: ByT5: problem with tokenizer.decode()

Environment info

  • transformers version: 4.11.0
  • Platform: Google Colab
  • Python version: 3.7.12
  • Using GPU in script?: NO
  • Using distributed or parallel set-up in script?: NO

Who can help

ByT5: @patrickvonplaten Documentation: @sgugger

Information

Model I am using: google/byt5-small (the problem is the same with google/byt5-base).

To reproduce

See this notebook that shows the problem when using google/byt5-small from the model hub of Hugging Face and the tokenizer.decode() method, when the transformers version is 4.11.0.

The problem does not appear with the transformers version 4.9.2 for example.

from transformers import T5ForConditionalGeneration, ByT5Tokenizer

model_checkpoint = 'google/byt5-small'
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
tokenizer = ByT5Tokenizer.from_pretrained(model_checkpoint)

texts = ["Life is like a box of chocolates.", "Today is Monday."]

for text in texts:
  inputs = tokenizer(text, padding="longest", return_tensors="pt")
  output = model.generate(**inputs)
  print(tokenizer.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

Error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-6f8451a23561> in <module>()
      6       output[0],
      7       skip_special_tokens=True,
----> 8       clean_up_tokenization_spaces=True
      9       )
     10   )

2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/byt5/tokenization_byt5.py in convert_tokens_to_string(self, tokens)
    238                 tok_string = bytes([ord(token)])
    239             bstring += tok_string
--> 240         string = bstring.decode("utf-8")
    241         return string
    242 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Expected behavior

2 strings as outputs of the ByT5.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (14 by maintainers)

Most upvoted comments

Thanks, @Narsil.

It seems to be the case! I re-trained the models and they work perfectly fine now, and with good BLEU and CER scores 😃

@versae Is it possible that you fed the models with unicode codepoints during training and not utf-8 encoded bytes ? This looks like it, but I can’t be sure. Since I think most accented spanish letters are still below 255 you might not have encountered any issue and been able to train your model just fine.

Just to make sure I tested, that the byt5 tokenizer would encode presunción with the correct encoding:

tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
okenizer.encode('presunción')
>>> [115, 117, 104, 118, 120, 113, 102, 108, 198, 182, 113, 1]
>>> # (195 + 179) so it works

If that’s the case, then the good news is you don’t necessarily need to retrain the model but maybe you need to override this function to change with your fix. Something along the way of :

  def convert_tokens_to_string(self, tokens):
      """Converts a sequence of tokens (string) in a single string."""
      bstring = b""
      for token in tokens:
          if token in self.special_tokens_decoder:
              tok_string = self.special_tokens_decoder[token].encode("utf-8")
          elif token in self.added_tokens_decoder:
              tok_string = self.special_tokens_decoder[token].encode("utf-8")
          elif token in self.special_tokens_encoder:
              tok_string = token.encode("utf-8")
          elif token in self.added_tokens_encoder:
              tok_string = token.encode("utf-8")
          else:
              tok_string = token.encode("utf-8")
          bstring += tok_string
      string = bstring.decode("utf-8", errors="ignore")
      return string

tokenizer.convert_tokens_to_string = convert_tokens_to_string

Keep in mind:

1- This is a dirty hack 2- It might not be the core of the issue (it could be mistrained model, or some other error at training time). If it’s not the core issue, this fix might just be hiding the true culprit and leading to more errors downstream. 3- You have now effectively broken your tokenizer since it won’t encode the same things it decodes

But it should do the job for your specific purpose.

If you could also provide a link/script to how you trained it might provide more insights into what went wrong.

If google did it, then let’s do it.

To weigh in on this discussion, I wanted to reiterate the points raised by @piegu:

it is not possible anymore:

  • to use for example model.generate() with a ByT5 model (because it will fail)
  • and it is not possible to finetune a ByT5 model (because when evaluating metrics it will use tokenizer.decode() that will fail).

This means that it would always be required to overwrite the evaluate function when using Seq2SeqTrainer in combination with predict_with_generate, unless the decode_bytes option is directly addressed in the Trainer/generate implementation as well (creating additional overhead).

Since it is required to pass a Tokenizer in any case, I would prefer the option to choose directly through the tokenizer whether to ignore errors or not. I agree that it would have to be quite visible, but even for the T5 repository’s implementation, this behavior is not very obvious (reference issue), but implemented with ignoring the errors by default.

As to this point:

So checking two bytes objects was probably the way it was done as this is always possible. Take the generated output, convert to bytes, take expected string, convert to bytes and compare the.

I don’t see any indication of the evaluation on bytes objects instead of string, as there seem to be no modifications on top of the vanilla T5 modeling from their own repository.