transformers: BertTokenizer: ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

šŸ› Bug

Information

Tokenizer I am using is BertTokenizer and Iā€™ve also tried using AlbertTokenizer, but it does not have any effect. So Iā€™m thinking that the bug is in the base tokenizer

Language I am using the model on is English, but I donā€™t believe thatā€™s the issue.

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Version: transformers==2.11.0
  2. Run this code
from transformers import BertModel, BertTokenizer
text = 'A quick brown fox jumps over' # Just a dummy text
BertTokenizer.encode_plus(
    text.split(' '),
    None,
    add_special_tokens = True,
    max_length = 512)
  1. This should be the error
Traceback (most recent call last):
  File "classification.py", line 23, in <module>
    max_length = 512)
  File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1576, in encode_plus
    first_ids = get_input_ids(text)
  File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1556, in get_input_ids
    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

And yes, Iā€™ve tried just inputting a string, and I still got the same error.

Expected behavior

I want the encoder_plus function to return an encoded version of the input sequence.

Environment info

  • transformers version: 2.11.0
  • Platform: Windows
  • Python version: 3.7.4
  • PyTorch version (GPU?): 1.5.0+cpu
  • Tensorflow version (GPU?): (Not used)
  • Using GPU in script?: Nope
  • Using distributed or parallel set-up in script?: No

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (2 by maintainers)

Most upvoted comments

One Possible Reason for ā€˜ValueError: Input is not validā€™ (caused by pandas)

In my case, I got this problem, because in the dataset, one sample row is ā€˜Noneā€™ (Type string). However when I load dataset, the pandas will automatically transform the str ā€˜Noneā€™ into Nonetype ā€˜nanā€™, and cause the ValueError when I doing tokenization. Same problem will occur when you use huggingface load_dataset or dataset.from_csv function, because they seems to actually use pandas for reading files.

Here is my code meet problem:

from datasets import Dataset
import pandas as pd
testset = pd.read_csv('./rotate_tomato/test.tsv', sep='\t')  
testset = Dataset.from_pandas(testset)
model = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def tokz(x):
    return tokenizer(x['Phrase'], padding=True, truncation=True, return_tensors="pt")

testset_tokz = testset.map(tokz, batched=True)

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

I recommend to check if your dataset has special string like ā€˜Noneā€™, ā€˜NAā€™ etcā€¦

My solution

testset = pd.read_csv('./rotate_tomato/test.tsv', sep='\t', keep_default_na=False)

I simply set keep_default_na=False to prevent pandas to detect na values and transform then into Nonetypeā€¦ And then everything goes well.

Iā€™m still facing the same issue: ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. While trying to run run_squad.py. Iā€™m trying to train and test it with: https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Hi @mariusjohan, we welcome all models here šŸ˜ƒ The hub is a very easy way to share models. The way youā€™re training it will surely be different to other trainings, so sharing it on the hub with details of how you trained it is always welcome!

@LysandreJik, while I have you. I know this aint the right place to ask you, but.

Iā€™ve seen that youā€™re about to release the Electra modeling for question answering, and Iā€™ve written a small script for training the electra discriminator for question answering, and Iā€™m about to train the model. so Would it be useful for you if I trained the model, or are you already doing that?