transformers: _batch_encode_plus() got an unexpected keyword argument 'is_pretokenized' using BertTokenizerFast

System Info

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
  print('{0:10}  {1}'.format(token, label))

The error I am getting is:
Traceback (most recent call last):
  File "C:\Users\1632613\Documents\Anit\NER_Trans\test.py", line 108, in <module>
    for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
  File "C:\Users\1632613\Documents\Anit\NER_Trans\test.py", line 66, in __getitem__
    encoding = self.tokenizer(sentence,
  File "C:\Users\1632613\AppData\Local\conda\conda\envs\ner\lib\site-packages\transformers\tokenization_utils_base.py", line 2477, in __call__
    return self.batch_encode_plus(
  File "C:\Users\1632613\AppData\Local\conda\conda\envs\ner\lib\site-packages\transformers\tokenization_utils_base.py", line 2668, in batch_encode_plus
    return self._batch_encode_plus(
TypeError: _batch_encode_plus() got an unexpected keyword argument 'is_pretokenized'

Who can help?

@SaulLu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Download the NER Dataset from the Kaggle link (https://www.kaggle.com/datasets/namanj27/ner-dataset)
Use the Script with the mentioned versions of transformers and tokenizers: tokenizer = BertTokenizerFast.from_pretrained(‘bert-base-uncased’) for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0][“input_ids”]), training_set[0][“labels”]): print(‘{0:10} {1}’.format(token, label))

Expected behavior

I expect to get the token, label from the script above.

Python Version: 3.9
tokenizers-0.12.1 
transformers-4.19.2

Anyone can shed some light please?

About this issue

Original URL
State: open
Created 2 years ago
Comments: 20 (4 by maintainers)

Commits related to this issue

Fixing Issue #17488. Add changes to make the error thrown consistent in both decode and encode functions of Tokenizer #28492 — committed to bayllama/transformers by bayllama 6 months ago
Fixing Issue #17488. Add changes to make the error thrown consistent in both decode and encode functions of Tokenizer — committed to bayllama/transformers by bayllama 6 months ago
Fixing Issue #17488. Add changes to make the error thrown consistent in both decode and encode functions of Tokenizer — committed to bayllama/transformers by bayllama 6 months ago
Fixing Issue #17488. Add changes to make the error thrown consistent in both decode and encode functions of Tokenizer — committed to bayllama/transformers by bayllama 6 months ago
Fixing Issue #17488. Add changes to make the error thrown consistent in both decode and encode functions of Tokenizer — committed to bayllama/transformers by bayllama 5 months ago
Fixing Issue #17488. Add changes to make the error thrown consistent in both decode and encode functions of Tokenizer — committed to bayllama/transformers by bayllama 5 months ago

Most upvoted comments

I am having the same problem

here is the output of transformers-cli env

- `transformers` version: 4.25.1
- Platform: Linux-5.10.133+-x86_64-with-glibc2.27
- Python version: 3.8.16
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.0+cu116 (True)
- Tensorflow version (GPU?): 2.9.2 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

you can also find the colab notebook here

naarkhoo on Dec 17, 2022