transformers: IndexError: index out of range in self

šŸ› Bug

Information

The model I am using Bert (ā€˜bert-large-uncased’) and I am facing two issues related to this model

The language I am using the model on English

The problem arises when using:

When I am trying to encode a large sentence ( sentence length 500 words ), I am getting this error :

IndexError: index out of range in self

I tried to set max_words length as 400, still getting same error :

Data I am using can be downloaded like this :

from sklearn.datasets import fetch_20newsgroups
import re

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

print("\n".join(twenty_train.data[0].split("\n")[:3]))


X_tratado = []

for email in range(0, len(twenty_train.data)): 
    
    # Remover caracteres especiais
    texto = re.sub(r'\\r\\n', ' ', str(twenty_train.data[email]))
    texto = re.sub(r'\W', ' ', texto)
    
    # Remove caracteres simples de uma letra
    texto = re.sub(r'\s+[a-zA-Z]\s+', ' ', texto)
    texto = re.sub(r'\^[a-zA-Z]\s+', ' ', texto) 

    # Substitui multiplos espaƧos por um unico espaƧo
    texto = re.sub(r'\s+', ' ', texto, flags=re.I)

    # Remove o 'b' que aparece no comeƧo
    texto = re.sub(r'^b\s+', '', texto)

    # Converte para minĆŗsculo
    texto = texto.lower()

    X_tratado.append(texto)
    
dr = {}
dr ['text'] = X_tratado
dr ['labels'] = twenty_train.target

Now I am using bert model to encode the sentences :

from transformers import BertModel, BertConfig, BertTokenizer
import torch
    
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model     = BertModel.from_pretrained('bert-large-uncased')
inputs    = tokenizer(datar[7], return_tensors="pt")
outputs   = model(**inputs)
features  = outputs[0][:,0,:].detach().numpy().squeeze()

Which is giving this error :

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-41-5dcf440b245f> in <module>
      5 model     = BertModel.from_pretrained('bert-large-uncased')
      6 inputs    = tokenizer(datar[7], return_tensors="pt")
----> 7 outputs   = model(**inputs)
      8 features  = outputs[0][:,0,:].detach().numpy().squeeze()

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/tfproject/tfenv/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states)
    751 
    752         embedding_output = self.embeddings(
--> 753             input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
    754         )
    755         encoder_outputs = self.encoder(

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/tfproject/tfenv/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
    177         if inputs_embeds is None:
    178             inputs_embeds = self.word_embeddings(input_ids)
--> 179         position_embeddings = self.position_embeddings(position_ids)
    180         token_type_embeddings = self.token_type_embeddings(token_type_ids)
    181 

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
--> 114             self.norm_type, self.scale_grad_by_freq, self.sparse)
    115 
    116     def extra_repr(self):

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1722         # remove once script supports set_grad_enabled
   1723         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1724     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1725 
   1726 

IndexError: index out of range in self

The second issue I am facing, When I am using this bert model to encode many sentences, It seems Bert is not using GPU :

Screenshot 2020-07-09 at 12 45 14 AM

How to accelerate GPU while using bert model?

Environment info

  • transformers version: ā€˜3.0.0’
  • Platform: Ubuntu 18.04.4 LTS
  • Python version: python3.7
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?): '2.2.0
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (2 by maintainers)

Most upvoted comments

Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.

HI @monk1337, the error here is because you’ve called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.

For your second question, indeed your model is not on your GPU. With PyTorch, you have to cast your model to the device you want it to run it, so you would have to do something like:

from transformers import BertModel, BertConfig, BertTokenizer
import torch
    
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model     = BertModel.from_pretrained('bert-large-uncased')
inputs    = tokenizer(datar[7], return_tensors="pt")

model.cuda()
inputs = {k: v.cuda() for k, v in inputs.items()}

outputs   = model(**inputs)
features  = outputs[0][:,0,:].detach().numpy().squeeze()

Please note I’ve also cast the input tensors on GPU, as the model inputs need to be on the same device as the model.

I recommend looking at the CUDA part of the 60 minute blitz tutorial for PyTorch on the PyTorch website to get an understanding of the CUDA semantics.

Closing this for now, let me know if you have other issues.

HI @monk1337, the error here is because you’ve called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.

@LysandreJik Is there anyway we can change the limit? Trying to process a large document. I am using facebook/bart-large-cnn

Thanks.

HI @monk1337, the error here is because you’ve called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.

@LysandreJik Is there anyway we can change the limit? Trying to process a large document. I am using facebook/bart-large-cnn

Thanks.

Try using the Longformer transformer. The pre-trained ones on huggingface can process up to 16k tokens. I used it for my dissertation where I was processing large documents

Thanks, that’s It actually. I Also realised It too late… So much time Lost šŸ˜„

Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.

Thanks very much. It works for me after making vocab_size larger in bert config.