transformers: BertTokenizer: ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

šŸ› Bug

Information

Tokenizer I am using is BertTokenizer and I’ve also tried using AlbertTokenizer, but it does not have any effect. So I’m thinking that the bug is in the base tokenizer

Language I am using the model on is English, but I don’t believe that’s the issue.

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Version: transformers==2.11.0
  2. Run this code
from transformers import BertModel, BertTokenizer
text = 'A quick brown fox jumps over' # Just a dummy text
BertTokenizer.encode_plus(
    text.split(' '),
    None,
    add_special_tokens = True,
    max_length = 512)
  1. This should be the error
Traceback (most recent call last):
  File "classification.py", line 23, in <module>
    max_length = 512)
  File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1576, in encode_plus
    first_ids = get_input_ids(text)
  File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1556, in get_input_ids
    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

And yes, I’ve tried just inputting a string, and I still got the same error.

Expected behavior

I want the encoder_plus function to return an encoded version of the input sequence.

Environment info

  • transformers version: 2.11.0
  • Platform: Windows
  • Python version: 3.7.4
  • PyTorch version (GPU?): 1.5.0+cpu
  • Tensorflow version (GPU?): (Not used)
  • Using GPU in script?: Nope
  • Using distributed or parallel set-up in script?: No

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (2 by maintainers)

Most upvoted comments

One Possible Reason for ā€˜ValueError: Input is not valid’ (caused by pandas)

In my case, I got this problem, because in the dataset, one sample row is ā€˜None’ (Type string). However when I load dataset, the pandas will automatically transform the str ā€˜None’ into Nonetype ā€˜nan’, and cause the ValueError when I doing tokenization. Same problem will occur when you use huggingface load_dataset or dataset.from_csv function, because they seems to actually use pandas for reading files.

Here is my code meet problem:

from datasets import Dataset
import pandas as pd
testset = pd.read_csv('./rotate_tomato/test.tsv', sep='\t')  
testset = Dataset.from_pandas(testset)
model = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def tokz(x):
    return tokenizer(x['Phrase'], padding=True, truncation=True, return_tensors="pt")

testset_tokz = testset.map(tokz, batched=True)

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

I recommend to check if your dataset has special string like ā€˜None’, ā€˜NA’ etc…

My solution

testset = pd.read_csv('./rotate_tomato/test.tsv', sep='\t', keep_default_na=False)

I simply set keep_default_na=False to prevent pandas to detect na values and transform then into Nonetype… And then everything goes well.

I’m still facing the same issue: ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. While trying to run run_squad.py. I’m trying to train and test it with: https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Hi @mariusjohan, we welcome all models here 😃 The hub is a very easy way to share models. The way you’re training it will surely be different to other trainings, so sharing it on the hub with details of how you trained it is always welcome!

@LysandreJik, while I have you. I know this aint the right place to ask you, but.

I’ve seen that you’re about to release the Electra modeling for question answering, and I’ve written a small script for training the electra discriminator for question answering, and I’m about to train the model. so Would it be useful for you if I trained the model, or are you already doing that?