transformers: Add missing tokenizer test files [:building_construction: in progress]
š Add missing tokenizer test files
Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.
Tokenizers concerned
not yet claimed
none
claimed
- LED @nnlnr
- Flaubert @anmolsjoshi
- Electra @Rajathbharadwaj
- ConvBert @elusenji
- RemBert @IMvision12
- Splinter @ashwinjohn3
with an ongoing PR
none
with an accepted PR
How to contribute?
-
Claim a tokenizer
a. Choose a tokenizer from the list of ānot yet claimedā tokenizers
b. Check that no one in the messages for this issue has indicated that they care about this tokenizer
c. Put a message in the issue that you are handling this tokenizer
-
Create a local development setup (if you have not already done it)
I refer you to section āstart-contributing-pull-requestsā of the Contributing guidelines where everything is explained. Donāt be afraid with step 5. For this contribution you will only need to test locally the tests you add.
-
Follow the instructions on the readme inside the
templates/adding_a_missing_tokenization_test
folder to generate the template with cookie cutter for the new test file you will be adding. Donāt forget to move the new test file at the end of the template generation to the sub-folder named after the model for which you are adding the test file in thetests
folder. Some details about questionnaire - assuming that the name of the lowcase model isbrand_new_bert
:- āhas_slow_classā: Set true there is a
tokenization_brand_new_bert.py
file in the foldersrc/transformers/models/brand_new_bert
- āhas_fast_classā: Set true there is a
tokenization_brand_new_bert_fast.py
file the foldersrc/transformers/models/brand_new_bert
. - āslow_tokenizer_use_sentencepieceā: Set true if the tokenizer defined in the
tokenization_brand_new_bert.py
file uses sentencepiece. If this tokenizer donāt have a ``tokenization_brand_new_bert.py` file set False.
- āhas_slow_classā: Set true there is a
-
Complete the
setUp
method in the generated test file, you can take inspiration for how it is done for the other tokenizers. -
Try to run all the added tests. It is possible that some tests will not pass, so it will be important to understand why, sometimes the common test is not suited for a tokenizer and sometimes a tokenizer can have a bug. You can also look at what is done in similar tokenizer tests, if there are big problems or you donāt know what to do we can discuss this in the PR (step 7.).
-
(Bonus) Try to get a good understanding of the tokenizer to add custom tests to the tokenizer
-
Open an PR with the new test file added, remember to fill in the RP title and message body (referencing this PR) and request a review from @LysandreJik and @SaulLu.
Tips
Do not hesitate to read the questions / answers in this issue š°
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 3
- Comments: 58 (36 by maintainers)
Commits related to this issue
- [Splinter] Fixes #16627 by implementing the test cases for splinter — committed to nileshkokane01/transformers by deleted user 7 months ago
Hi @SaulLu, Iād be happy to work on
LED
- Thanks!!Hi @logvinata yes, I am still going to work on it. I was off for a while but will soon open a PR on it.
@logvinata you can take splinter. Iām not working on it anymore.
Hey all! š¤ If you donāt find a PR open for any model feel free to do so. If a PR is inactive for quite some time, just ping the author to make sure he is alright with you taking over or if he still want to contribute ! (unless itās inactive for more than 2 months I think itās alright to work on it) šš»
hey @ArthurZucker, Iām happy to have a look at contributing to a few of these. Iāll start off with
gpt_neox
š@ArthurZucker thanks for your reply. I will start working on RemBert tests.
Hey @y3sar thanks for wanting to contribute. I think that the RemBert tests PR was close, you can probably take that over if you want! Other tests that might be missing:
Yeah sure @danhphan Thanks.
Thanks @SaulLu, Iām working on this
RemBert
šHi @SaulLu, I would like to work on
RetriBert
.Hi @SaulLu, I am happy to write tests for
RemBert
. Thanks.Yes, this helps!
Iād like to work on ConvBert.
Hi, Iām happy to take
MobileBert
Hi @tgadeliya ,
Thanks for the update!
In your case, I wouldnāt be surprised if Longformer uses the same tokenizer as RoBERTa. In this case, it seems legitimate to use the same toy tokenizer. Maybe the only check you can do to confirm this hypothesis is comparing the vocabularies of the 'main" checkpoints of both models:
Turn out the result confirms it!
Hi @anmolsjoshi, @tgadeliya, @Rajathbharadwaj and @farahdian,
Just a quick message to see how things are going for you and if you have any problems. If you do, please share them! š¤
Thanks so much @SaulLu turns out it was due to not recognizing my installed cookiecutter so i sorted it out there. š
@faiazrahman , thank you so much for working on this! Regarding your issue, if youāre in the
tests/splinter
folder, can you try to runcookiecutter ../../templates/adding_a_missing_tokenization_test/
?You should have a newly created folder
cookiecutter-template-BrandNewBERT
insidetests/splinter
. šIf thatās the case, youāll need after to do something like:
Keep me posted š
Is anyone else encountering this error with the cookiecutter command? my dev environment set up seemed to have went all fineā¦ Also I had run the command inside the
tests/splinter
directoryHi, first time contributor here-could I add tests for
Splinter
?Hey I would like to contribute for
Electra
,Pointers please!@SaulLu I would like to add tests for Flaubert
Hi, I would like to add tests for
Longformer
tokenizer