txtai: embeddings.index Truncation RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1

Hello, when I try to run the indexing step, I get this error.

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-33-6e863ca8aecc> in <module>
----> 1 embeddings.index(to_index)

~\Anaconda3\envs\bert2\lib\site-packages\txtai\embeddings.py in index(self, documents)
     80 
     81         # Transform documents to embeddings vectors
---> 82         ids, dimensions, stream = self.model.index(documents)
     83 
     84         # Load streamed embeddings back to memory

~\Anaconda3\envs\bert2\lib\site-packages\txtai\vectors.py in index(self, documents)
    245                 if len(batch) == 500:
    246                     # Convert batch to embeddings
--> 247                     uids, dimensions = self.batch(batch, output)
    248                     ids.extend(uids)
    249 

~\Anaconda3\envs\bert2\lib\site-packages\txtai\vectors.py in batch(self, documents, output)
    279 
    280         # Build embeddings
--> 281         embeddings = self.model.encode(documents, show_progress_bar=False)
    282         for embedding in embeddings:
    283             if not dimensions:

~\Anaconda3\envs\bert2\lib\site-packages\sentence_transformers\SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
    192 
    193             with torch.no_grad():
--> 194                 out_features = self.forward(features)
    195 
    196                 if output_value == 'token_embeddings':

~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
    117     def forward(self, input):
    118         for module in self:
--> 119             input = module(input)
    120         return input
    121 

~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~\Anaconda3\envs\bert2\lib\site-packages\sentence_transformers\models\Transformer.py in forward(self, features)
     36             trans_features['token_type_ids'] = features['token_type_ids']
     37 
---> 38         output_states = self.auto_model(**trans_features, return_dict=False)
     39         output_tokens = output_states[0]
     40 

~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~\Anaconda3\envs\bert2\lib\site-packages\transformers\models\bert\modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    962         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
    963 
--> 964         embedding_output = self.embeddings(
    965             input_ids=input_ids,
    966             position_ids=position_ids,

~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~\Anaconda3\envs\bert2\lib\site-packages\transformers\models\bert\modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    205         if self.position_embedding_type == "absolute":
    206             position_embeddings = self.position_embeddings(position_ids)
--> 207             embeddings += position_embeddings
    208         embeddings = self.LayerNorm(embeddings)
    209         embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1

Where to_index =

 [('0015023cc06b5362d332b3baf348d11567ca2fbb',
  'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3\nword count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without permission. Abstract 27 The positive stranded RNA genomes of picornaviruses comprise a single large open reading 28 frame flanked by 5′ and 3′ untranslated regions (UTRs). Foot-and-mouth disease virus (FMDV) 29 has an unusually large 5′ UTR (1.3 kb) containing five structural domains. These include the 30 internal ribosome entry site (IRES), which facilitates initiation of translation, and the cis-acting 31 replication element (cre). Less well characterised structures are a 5′ terminal 360 nucleotide 32 stem-loop, a variable length poly-C-tract of approximately 100-200 nucleotides and a series of 33 two to four tandemly repeated pseudoknots (PKs). We investigated the structures of the PKs 34 by selective 2′ hydroxyl acetylation analysed by primer extension (SHAPE) analysis and 35 determined their contribution to genome replication by mutation and deletion experiments. 36 SHAPE and mutation experiments confirmed the importance of the previously predicted PK 37 structures for their function. Deletion experiments showed that although PKs are not essential 38',
  None),
 ('00340eea543336d54adda18236424de6a5e91c9d',
  'Analysis Title: Regaining perspective on SARS-CoV-2 molecular tracing and its implications\nDuring the past three months, a new coronavirus (SARS-CoV-2) epidemic has been growing exponentially, affecting over 100 thousand people worldwide, and causing enormous distress to economies and societies of affected countries. A plethora of analyses based on viral sequences has already been published, in scientific journals as well as through non-peer reviewed channels, to investigate SARS-CoV-2 genetic heterogeneity and spatiotemporal dissemination. We examined all full genome sequences currently available to assess the presence of sufficient information for reliable phylogenetic and phylogeographic studies. Our analysis clearly shows severe limitations in the present data, in light of which any finding should be considered, at the very best, preliminary and hypothesis-generating. Hence the need for avoiding stigmatization based on partial information, and for continuing concerted efforts to increase number and quality of the sequences required for robust tracing of the epidemic.',
  None),
 ('004f0f8bb66cf446678dc13cf2701feec4f36d76',
  'Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China\n',
  None), ...]

How do I fix this? I don’t see anywhere in the documentation about this. I assume the error message:

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. 
Default to no truncation.

is related, and I need to set a max_length to 512 so that any documents that are larger than 512 get truncated to 512 tokens, but I don’t see anywhere how to do that…

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (9 by maintainers)

Commits related to this issue

Synchronize truncation/max length logic between embeddings and pipelines. Addresses #74 and #79 — committed to neuml/txtai by davidmezzetti 3 years ago

Most upvoted comments

I can confirm that as of the latest commit, my code using a pretrained Bluebert model works without running into the issue! Thanks for the fix!

shinthor on Apr 12, 2021

Just committed a fix that should address both embeddings and pipelines. The maxlength parameter is no longer needed, it will take the max_position_embeddings config parameter when it’s not detected in the tokenizer.

davidmezzetti on Apr 12, 2021

Just committed a fix for this. You can now also try installing the latest master branch pip install git+https://github.com/neuml/txtai

Otherwise, it will be in txtai 3.0

Actually pip install git+https://github.com/neuml/txtai still gives me the error, whether or not I used pip install sentence-transformers==0.4.1 or used the most recent version of sentence-transformers. I think there may be an issue with the fix you committed? The only combination that has worked for me was the pypi version of txtai with sentence-transformers version 0.4.1

shinthor on Apr 6, 2021