transformers: Add missing tokenizer test files [:building_construction: in progress]

🚀 Add missing tokenizer test files

Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.

Tokenizers concerned

not yet claimed

none

claimed

LED @nnlnr
Flaubert @anmolsjoshi
Electra @Rajathbharadwaj
ConvBert @elusenji
RemBert @IMvision12
Splinter @ashwinjohn3

with an ongoing PR

none

with an accepted PR

Longformer @tgadeliya #17677
MobileBert @leondz #16896
RetriBert @mpoemsl #17017

How to contribute?

Claim a tokenizer

a. Choose a tokenizer from the list of “not yet claimed” tokenizers

b. Check that no one in the messages for this issue has indicated that they care about this tokenizer

c. Put a message in the issue that you are handling this tokenizer
Create a local development setup (if you have not already done it)

I refer you to section “start-contributing-pull-requests” of the Contributing guidelines where everything is explained. Don’t be afraid with step 5. For this contribution you will only need to test locally the tests you add.
Follow the instructions on the readme inside the templates/adding_a_missing_tokenization_test folder to generate the template with cookie cutter for the new test file you will be adding. Don’t forget to move the new test file at the end of the template generation to the sub-folder named after the model for which you are adding the test file in the tests folder. Some details about questionnaire - assuming that the name of the lowcase model is brand_new_bert:
- “has_slow_class”: Set true there is a tokenization_brand_new_bert.py file in the folder src/transformers/models/brand_new_bert
- “has_fast_class”: Set true there is a tokenization_brand_new_bert_fast.py file the folder src/transformers/models/brand_new_bert.
- “slow_tokenizer_use_sentencepiece”: Set true if the tokenizer defined in the tokenization_brand_new_bert.py file uses sentencepiece. If this tokenizer don’t have a ``tokenization_brand_new_bert.py` file set False.
Complete the setUp method in the generated test file, you can take inspiration for how it is done for the other tokenizers.
Try to run all the added tests. It is possible that some tests will not pass, so it will be important to understand why, sometimes the common test is not suited for a tokenizer and sometimes a tokenizer can have a bug. You can also look at what is done in similar tokenizer tests, if there are big problems or you don’t know what to do we can discuss this in the PR (step 7.).
(Bonus) Try to get a good understanding of the tokenizer to add custom tests to the tokenizer
Open an PR with the new test file added, remember to fill in the RP title and message body (referencing this PR) and request a review from @LysandreJik and @SaulLu.

Tips

Do not hesitate to read the questions / answers in this issue 📰

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 3
Comments: 58 (36 by maintainers)

Commits related to this issue

[Splinter] Fixes #16627 by implementing the test cases for splinter — committed to nileshkokane01/transformers by deleted user 7 months ago

Most upvoted comments

Hi @SaulLu, I’d be happy to work on LED - Thanks!!

nnlnr on Apr 28, 2022

Hi @logvinata yes, I am still going to work on it. I was off for a while but will soon open a PR on it.

ENate on Feb 12, 2024

@logvinata you can take splinter. I’m not working on it anymore.

nileshkokane01 on Feb 12, 2024

Hey all! 🤗 If you don’t find a PR open for any model feel free to do so. If a PR is inactive for quite some time, just ping the author to make sure he is alright with you taking over or if he still want to contribute ! (unless it’s inactive for more than 2 months I think it’s alright to work on it) 👍🏻

ArthurZucker on Nov 20, 2023

hey @ArthurZucker, I’m happy to have a look at contributing to a few of these. I’ll start off with gpt_neox 🙂

rchan26 on Sep 28, 2023

@ArthurZucker thanks for your reply. I will start working on RemBert tests.

y3sar on May 25, 2023

Hey @y3sar thanks for wanting to contribute. I think that the RemBert tests PR was close, you can probably take that over if you want! Other tests that might be missing:

./tests/models/flaubert
./tests/models/convbert
./tests/models/splinter
./tests/models/gpt_neox
./tests/models/rembert

ArthurZucker on May 25, 2023

Yeah sure @danhphan Thanks.

IMvision12 on Oct 30, 2022

Thanks @SaulLu, I’m working on this RemBert 😃

danhphan on May 1, 2022

Hi @SaulLu, I would like to work on RetriBert.

mpoemsl on Apr 28, 2022

Hi @SaulLu, I am happy to write tests for RemBert. Thanks.

danhphan on Apr 27, 2022

Yes, this helps!

elusenji on Apr 22, 2022

I’d like to work on ConvBert.

elusenji on Apr 20, 2022

Hi, I’m happy to take MobileBert

leondz on Apr 20, 2022

Hi @tgadeliya ,

Thanks for the update!

But I think I have one doubt, that you can resolve. Are you anticipating from Longformer tests to have different toy tokenizer example than in RoBERTa tests? Or maybe I should write my own tests from scratch?

In your case, I wouldn’t be surprised if Longformer uses the same tokenizer as RoBERTa. In this case, it seems legitimate to use the same toy tokenizer. Maybe the only check you can do to confirm this hypothesis is comparing the vocabularies of the 'main" checkpoints of both models:

!wget https://huggingface.co/allenai/longformer-base-4096/raw/main/merges.txt
!wget https://huggingface.co/allenai/longformer-base-4096/raw/main/vocab.json
!wget https://huggingface.co/roberta-base/raw/main/merges.txt
!wget https://huggingface.co/roberta-base/raw/main/vocab.json

!diff merges.txt merges.txt.1
!diff vocab.json vocab.json.1

Turn out the result confirms it!

SaulLu on Apr 19, 2022

Hi @anmolsjoshi, @tgadeliya, @Rajathbharadwaj and @farahdian,

Just a quick message to see how things are going for you and if you have any problems. If you do, please share them! 🤗

SaulLu on Apr 19, 2022

Thanks so much @SaulLu turns out it was due to not recognizing my installed cookiecutter so i sorted it out there. 👍

farahdian on Apr 12, 2022

@faiazrahman , thank you so much for working on this! Regarding your issue, if you’re in the tests/splinter folder, can you try to run cookiecutter ../../templates/adding_a_missing_tokenization_test/ ?

You should have a newly created folder cookiecutter-template-BrandNewBERT inside tests/splinter. 🙂

If that’s the case, you’ll need after to do something like:

mv cookiecutter-template-BrandNewBERT/test_tokenization_brand_new_bert.py .
rm -r cookiecutter-template-BrandNewBERT/

Keep me posted 😄

SaulLu on Apr 11, 2022

Is anyone else encountering this error with the cookiecutter command? my dev environment set up seemed to have went all fine… Also I had run the command inside the tests/splinter directory

Screenshot 2022-04-11 172638

farahdian on Apr 11, 2022

Hi, first time contributor here-could I add tests for Splinter?

farahdian on Apr 10, 2022

Hey I would like to contribute for Electra,Pointers please!

Rajathbharadwaj on Apr 6, 2022

@SaulLu I would like to add tests for Flaubert

anmolsjoshi on Apr 6, 2022

Hi, I would like to add tests for Longformer tokenizer

tgadeliya on Apr 6, 2022