transformers: Add missing tokenizer test files [:building_construction: in progress]

šŸš€ Add missing tokenizer test files

Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.

Tokenizers concerned

not yet claimed

none

claimed

with an ongoing PR

none

with an accepted PR

How to contribute?

  1. Claim a tokenizer

    a. Choose a tokenizer from the list of ā€œnot yet claimedā€ tokenizers

    b. Check that no one in the messages for this issue has indicated that they care about this tokenizer

    c. Put a message in the issue that you are handling this tokenizer

  2. Create a local development setup (if you have not already done it)

    I refer you to section ā€œstart-contributing-pull-requestsā€ of the Contributing guidelines where everything is explained. Donā€™t be afraid with step 5. For this contribution you will only need to test locally the tests you add.

  3. Follow the instructions on the readme inside the templates/adding_a_missing_tokenization_test folder to generate the template with cookie cutter for the new test file you will be adding. Donā€™t forget to move the new test file at the end of the template generation to the sub-folder named after the model for which you are adding the test file in the tests folder. Some details about questionnaire - assuming that the name of the lowcase model is brand_new_bert:

    • ā€œhas_slow_classā€: Set true there is a tokenization_brand_new_bert.py file in the folder src/transformers/models/brand_new_bert
    • ā€œhas_fast_classā€: Set true there is a tokenization_brand_new_bert_fast.py file the folder src/transformers/models/brand_new_bert.
    • ā€œslow_tokenizer_use_sentencepieceā€: Set true if the tokenizer defined in the tokenization_brand_new_bert.py file uses sentencepiece. If this tokenizer donā€™t have a ``tokenization_brand_new_bert.py` file set False.
  4. Complete the setUp method in the generated test file, you can take inspiration for how it is done for the other tokenizers.

  5. Try to run all the added tests. It is possible that some tests will not pass, so it will be important to understand why, sometimes the common test is not suited for a tokenizer and sometimes a tokenizer can have a bug. You can also look at what is done in similar tokenizer tests, if there are big problems or you donā€™t know what to do we can discuss this in the PR (step 7.).

  6. (Bonus) Try to get a good understanding of the tokenizer to add custom tests to the tokenizer

  7. Open an PR with the new test file added, remember to fill in the RP title and message body (referencing this PR) and request a review from @LysandreJik and @SaulLu.

Tips

Do not hesitate to read the questions / answers in this issue šŸ“°

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 3
  • Comments: 58 (36 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @SaulLu, Iā€™d be happy to work on LED - Thanks!!

Hi @logvinata yes, I am still going to work on it. I was off for a while but will soon open a PR on it.

@logvinata you can take splinter. Iā€™m not working on it anymore.

Hey all! šŸ¤— If you donā€™t find a PR open for any model feel free to do so. If a PR is inactive for quite some time, just ping the author to make sure he is alright with you taking over or if he still want to contribute ! (unless itā€™s inactive for more than 2 months I think itā€™s alright to work on it) šŸ‘šŸ»

hey @ArthurZucker, Iā€™m happy to have a look at contributing to a few of these. Iā€™ll start off with gpt_neox šŸ™‚

@ArthurZucker thanks for your reply. I will start working on RemBert tests.

Hey @y3sar thanks for wanting to contribute. I think that the RemBert tests PR was close, you can probably take that over if you want! Other tests that might be missing:

  • ./tests/models/flaubert
  • ./tests/models/convbert
  • ./tests/models/splinter
  • ./tests/models/gpt_neox
  • ./tests/models/rembert

Yeah sure @danhphan Thanks.

Thanks @SaulLu, Iā€™m working on this RemBert šŸ˜ƒ

Hi @SaulLu, I would like to work on RetriBert.

Hi @SaulLu, I am happy to write tests for RemBert. Thanks.

Yes, this helps!

Iā€™d like to work on ConvBert.

Hi, Iā€™m happy to take MobileBert

Hi @tgadeliya ,

Thanks for the update!

But I think I have one doubt, that you can resolve. Are you anticipating from Longformer tests to have different toy tokenizer example than in RoBERTa tests? Or maybe I should write my own tests from scratch?

In your case, I wouldnā€™t be surprised if Longformer uses the same tokenizer as RoBERTa. In this case, it seems legitimate to use the same toy tokenizer. Maybe the only check you can do to confirm this hypothesis is comparing the vocabularies of the 'main" checkpoints of both models:

!wget https://huggingface.co/allenai/longformer-base-4096/raw/main/merges.txt
!wget https://huggingface.co/allenai/longformer-base-4096/raw/main/vocab.json
!wget https://huggingface.co/roberta-base/raw/main/merges.txt
!wget https://huggingface.co/roberta-base/raw/main/vocab.json

!diff merges.txt merges.txt.1
!diff vocab.json vocab.json.1

Turn out the result confirms it!

Hi @anmolsjoshi, @tgadeliya, @Rajathbharadwaj and @farahdian,

Just a quick message to see how things are going for you and if you have any problems. If you do, please share them! šŸ¤—

Thanks so much @SaulLu turns out it was due to not recognizing my installed cookiecutter so i sorted it out there. šŸ‘

@faiazrahman , thank you so much for working on this! Regarding your issue, if youā€™re in the tests/splinter folder, can you try to run cookiecutter ../../templates/adding_a_missing_tokenization_test/ ?

You should have a newly created folder cookiecutter-template-BrandNewBERT inside tests/splinter. šŸ™‚

If thatā€™s the case, youā€™ll need after to do something like:

mv cookiecutter-template-BrandNewBERT/test_tokenization_brand_new_bert.py .
rm -r cookiecutter-template-BrandNewBERT/

Keep me posted šŸ˜„

Is anyone else encountering this error with the cookiecutter command? my dev environment set up seemed to have went all fineā€¦ Also I had run the command inside the tests/splinter directory

Screenshot 2022-04-11 172638

Hi, first time contributor here-could I add tests for Splinter?

Hey I would like to contribute for Electra,Pointers please!

@SaulLu I would like to add tests for Flaubert

Hi, I would like to add tests for Longformer tokenizer