transformers: ByT5: problem with tokenizer.decode()
Environment info
- transformers version: 4.11.0
- Platform: Google Colab
- Python version: 3.7.12
- Using GPU in script?: NO
- Using distributed or parallel set-up in script?: NO
Who can help
ByT5: @patrickvonplaten Documentation: @sgugger
Information
Model I am using: google/byt5-small
(the problem is the same with google/byt5-base
).
To reproduce
See this notebook that shows the problem when using google/byt5-small
from the model hub of Hugging Face and the tokenizer.decode()
method, when the transformers
version is 4.11.0.
The problem does not appear with the transformers
version 4.9.2 for example.
from transformers import T5ForConditionalGeneration, ByT5Tokenizer
model_checkpoint = 'google/byt5-small'
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
tokenizer = ByT5Tokenizer.from_pretrained(model_checkpoint)
texts = ["Life is like a box of chocolates.", "Today is Monday."]
for text in texts:
inputs = tokenizer(text, padding="longest", return_tensors="pt")
output = model.generate(**inputs)
print(tokenizer.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
Error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-8-6f8451a23561> in <module>()
6 output[0],
7 skip_special_tokens=True,
----> 8 clean_up_tokenization_spaces=True
9 )
10 )
2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/byt5/tokenization_byt5.py in convert_tokens_to_string(self, tokens)
238 tok_string = bytes([ord(token)])
239 bstring += tok_string
--> 240 string = bstring.decode("utf-8")
241 return string
242
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Expected behavior
2 strings as outputs of the ByT5.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (14 by maintainers)
Thanks, @Narsil.
It seems to be the case! I re-trained the models and they work perfectly fine now, and with good BLEU and CER scores 😃
@versae Is it possible that you fed the models with unicode codepoints during training and not utf-8 encoded bytes ? This looks like it, but I can’t be sure. Since I think most accented spanish letters are still below 255 you might not have encountered any issue and been able to train your model just fine.
Just to make sure I tested, that the
byt5
tokenizer would encodepresunción
with the correct encoding:If that’s the case, then the good news is you don’t necessarily need to retrain the model but maybe you need to override this function to change with your fix. Something along the way of :
Keep in mind:
1- This is a dirty hack 2- It might not be the core of the issue (it could be mistrained model, or some other error at training time). If it’s not the core issue, this fix might just be hiding the true culprit and leading to more errors downstream. 3- You have now effectively broken your tokenizer since it won’t encode the same things it decodes
But it should do the job for your specific purpose.
If you could also provide a link/script to how you trained it might provide more insights into what went wrong.
If google did it, then let’s do it.
To weigh in on this discussion, I wanted to reiterate the points raised by @piegu:
This means that it would always be required to overwrite the
evaluate
function when usingSeq2SeqTrainer
in combination withpredict_with_generate
, unless thedecode_bytes
option is directly addressed in theTrainer
/generate
implementation as well (creating additional overhead).Since it is required to pass a
Tokenizer
in any case, I would prefer the option to choose directly through the tokenizer whether to ignore errors or not. I agree that it would have to be quite visible, but even for the T5 repository’s implementation, this behavior is not very obvious (reference issue), but implemented with ignoring the errors by default.As to this point:
I don’t see any indication of the evaluation on
bytes
objects instead ofstring
, as there seem to be no modifications on top of the vanilla T5 modeling from their own repository.Update: this seems to be intended:
https://github.com/huggingface/transformers/commit/5c7789d4167064f7464b8801c7488a9a2878480a
Pinging @Narsil 😃