txtai: embeddings.index Truncation RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1
Hello, when I try to run the indexing step, I get this error.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-33-6e863ca8aecc> in <module>
----> 1 embeddings.index(to_index)
~\Anaconda3\envs\bert2\lib\site-packages\txtai\embeddings.py in index(self, documents)
80
81 # Transform documents to embeddings vectors
---> 82 ids, dimensions, stream = self.model.index(documents)
83
84 # Load streamed embeddings back to memory
~\Anaconda3\envs\bert2\lib\site-packages\txtai\vectors.py in index(self, documents)
245 if len(batch) == 500:
246 # Convert batch to embeddings
--> 247 uids, dimensions = self.batch(batch, output)
248 ids.extend(uids)
249
~\Anaconda3\envs\bert2\lib\site-packages\txtai\vectors.py in batch(self, documents, output)
279
280 # Build embeddings
--> 281 embeddings = self.model.encode(documents, show_progress_bar=False)
282 for embedding in embeddings:
283 if not dimensions:
~\Anaconda3\envs\bert2\lib\site-packages\sentence_transformers\SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
192
193 with torch.no_grad():
--> 194 out_features = self.forward(features)
195
196 if output_value == 'token_embeddings':
~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
117 def forward(self, input):
118 for module in self:
--> 119 input = module(input)
120 return input
121
~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
~\Anaconda3\envs\bert2\lib\site-packages\sentence_transformers\models\Transformer.py in forward(self, features)
36 trans_features['token_type_ids'] = features['token_type_ids']
37
---> 38 output_states = self.auto_model(**trans_features, return_dict=False)
39 output_tokens = output_states[0]
40
~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
~\Anaconda3\envs\bert2\lib\site-packages\transformers\models\bert\modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
962 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
963
--> 964 embedding_output = self.embeddings(
965 input_ids=input_ids,
966 position_ids=position_ids,
~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
~\Anaconda3\envs\bert2\lib\site-packages\transformers\models\bert\modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
205 if self.position_embedding_type == "absolute":
206 position_embeddings = self.position_embeddings(position_ids)
--> 207 embeddings += position_embeddings
208 embeddings = self.LayerNorm(embeddings)
209 embeddings = self.dropout(embeddings)
RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1
Where to_index =
[('0015023cc06b5362d332b3baf348d11567ca2fbb',
'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3\nword count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without permission. Abstract 27 The positive stranded RNA genomes of picornaviruses comprise a single large open reading 28 frame flanked by 5′ and 3′ untranslated regions (UTRs). Foot-and-mouth disease virus (FMDV) 29 has an unusually large 5′ UTR (1.3 kb) containing five structural domains. These include the 30 internal ribosome entry site (IRES), which facilitates initiation of translation, and the cis-acting 31 replication element (cre). Less well characterised structures are a 5′ terminal 360 nucleotide 32 stem-loop, a variable length poly-C-tract of approximately 100-200 nucleotides and a series of 33 two to four tandemly repeated pseudoknots (PKs). We investigated the structures of the PKs 34 by selective 2′ hydroxyl acetylation analysed by primer extension (SHAPE) analysis and 35 determined their contribution to genome replication by mutation and deletion experiments. 36 SHAPE and mutation experiments confirmed the importance of the previously predicted PK 37 structures for their function. Deletion experiments showed that although PKs are not essential 38',
None),
('00340eea543336d54adda18236424de6a5e91c9d',
'Analysis Title: Regaining perspective on SARS-CoV-2 molecular tracing and its implications\nDuring the past three months, a new coronavirus (SARS-CoV-2) epidemic has been growing exponentially, affecting over 100 thousand people worldwide, and causing enormous distress to economies and societies of affected countries. A plethora of analyses based on viral sequences has already been published, in scientific journals as well as through non-peer reviewed channels, to investigate SARS-CoV-2 genetic heterogeneity and spatiotemporal dissemination. We examined all full genome sequences currently available to assess the presence of sufficient information for reliable phylogenetic and phylogeographic studies. Our analysis clearly shows severe limitations in the present data, in light of which any finding should be considered, at the very best, preliminary and hypothesis-generating. Hence the need for avoiding stigmatization based on partial information, and for continuing concerted efforts to increase number and quality of the sequences required for robust tracing of the epidemic.',
None),
('004f0f8bb66cf446678dc13cf2701feec4f36d76',
'Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China\n',
None), ...]
How do I fix this? I don’t see anywhere in the documentation about this. I assume the error message:
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length.
Default to no truncation.
is related, and I need to set a max_length to 512 so that any documents that are larger than 512 get truncated to 512 tokens, but I don’t see anywhere how to do that…
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (9 by maintainers)
Commits related to this issue
- Synchronize truncation/max length logic between embeddings and pipelines. Addresses #74 and #79 — committed to neuml/txtai by davidmezzetti 3 years ago
I can confirm that as of the latest commit, my code using a pretrained Bluebert model works without running into the issue! Thanks for the fix!
Just committed a fix that should address both embeddings and pipelines. The maxlength parameter is no longer needed, it will take the max_position_embeddings config parameter when it’s not detected in the tokenizer.
Actually
pip install git+https://github.com/neuml/txtaistill gives me the error, whether or not I usedpip install sentence-transformers==0.4.1or used the most recent version of sentence-transformers. I think there may be an issue with the fix you committed? The only combination that has worked for me was the pypi version of txtai with sentence-transformers version 0.4.1