transformers: BertTokenizer: ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
š Bug
Information
Tokenizer I am using is BertTokenizer and Iāve also tried using AlbertTokenizer, but it does not have any effect. So Iām thinking that the bug is in the base tokenizer
Language I am using the model on is English, but I donāt believe thatās the issue.
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Version:
transformers==2.11.0
- Run this code
from transformers import BertModel, BertTokenizer
text = 'A quick brown fox jumps over' # Just a dummy text
BertTokenizer.encode_plus(
text.split(' '),
None,
add_special_tokens = True,
max_length = 512)
- This should be the error
Traceback (most recent call last):
File "classification.py", line 23, in <module>
max_length = 512)
File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1576, in encode_plus
first_ids = get_input_ids(text)
File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1556, in get_input_ids
"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
And yes, Iāve tried just inputting a string, and I still got the same error.
Expected behavior
I want the encoder_plus function to return an encoded version of the input sequence.
Environment info
transformers
version: 2.11.0- Platform: Windows
- Python version: 3.7.4
- PyTorch version (GPU?): 1.5.0+cpu
- Tensorflow version (GPU?): (Not used)
- Using GPU in script?: Nope
- Using distributed or parallel set-up in script?: No
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 18 (2 by maintainers)
One Possible Reason for āValueError: Input is not validā (caused by pandas)
In my case, I got this problem, because in the dataset, one sample row is āNoneā (Type string). However when I load dataset, the pandas will automatically transform the str āNoneā into Nonetype ānanā, and cause the ValueError when I doing tokenization. Same problem will occur when you use huggingface load_dataset or dataset.from_csv function, because they seems to actually use pandas for reading files.
Here is my code meet problem:
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
I recommend to check if your dataset has special string like āNoneā, āNAā etcā¦
My solution
testset = pd.read_csv('./rotate_tomato/test.tsv', sep='\t', keep_default_na=False)
I simply set keep_default_na=False to prevent pandas to detect na values and transform then into Nonetypeā¦ And then everything goes well.
Iām still facing the same issue: ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. While trying to run run_squad.py. Iām trying to train and test it with: https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Hi @mariusjohan, we welcome all models here š The hub is a very easy way to share models. The way youāre training it will surely be different to other trainings, so sharing it on the hub with details of how you trained it is always welcome!
@LysandreJik, while I have you. I know this aint the right place to ask you, but.
Iāve seen that youāre about to release the Electra modeling for question answering, and Iāve written a small script for training the electra discriminator for question answering, and Iām about to train the model. so Would it be useful for you if I trained the model, or are you already doing that?