spaCy: SpanGroup doesn't allow access to original spans
How to reproduce the behaviour
Use case: I use a custom entity ruler to populate doc.spans, a SpanGroup per entity label, to allow overlapping spans between different entity types. After the NER step is completed, I check for certain conditions and for some spans I change kb_id. Looks like SpanGroup class creates copies of Spans when iterating through them. So the assigned kb_id gets lost. There is no elegant solution to the problem except for iterating through each Span in a SpanGroup and create a new group each time kb_id needs to be re-assigned. There is no way to delete or replace a particular span from SpanGroup.
Here is code example:
Set up
import spacy
from spacy.tokens import Span, SpanGroup
nlp = spacy.load("en_core_web_sm")
text = "Span 1, Span 2"
doc = nlp(text)
nlp('kb_id_2') # add to the vocab
span_1 = Span(doc, 0, 2, label = "TEST", kb_id = 'kb_id_1')
Iterating through SpanGroup and re-assigning kb_id
doc.spans['TEST'] = [span_1]
for span in doc.spans['TEST'] :
print("Before: %s" % span.kb_id_)
span.kb_id = nlp.vocab.strings['kb_id_2']
print("After: %s" % span.kb_id_)
Output:
Before: kb_id_1
After: kb_id_2
Now, iterate through the SpanGroup again:
for span in doc.spans['TEST'] :
print(span, "=>", span.kb_id_)
Output (unchanged kb_id):
Span 1 => kb_id_1
Any help will be greatly appreciated.
Your Environment
- spaCy version: 3.0.6
- Platform: Darwin-20.3.0-x86_64-i386-64bit
- Python version: 3.7.9
- Pipelines: en_core_web_lg (3.0.0), en_core_web_md (3.0.0), en_core_web_sm (3.0.0)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (11 by maintainers)
Commits related to this issue
- Span/SpanGroup: wrap SpanC in shared_ptr When a Span that was retrieved from a SpanGroup was modified, these changes were not reflected in the SpanGroup because the underlying SpanC struct was copied... — committed to danieldk/spaCy by danieldk 3 years ago
- Span/SpanGroup: wrap SpanC in shared_ptr When a Span that was retrieved from a SpanGroup was modified, these changes were not reflected in the SpanGroup because the underlying SpanC struct was copied... — committed to danieldk/spaCy by danieldk 3 years ago
- Span/SpanGroup: wrap SpanC in shared_ptr (#9869) * Span/SpanGroup: wrap SpanC in shared_ptr When a Span that was retrieved from a SpanGroup was modified, these changes were not reflected in the S... — committed to explosion/spaCy by danieldk 2 years ago
- Span/SpanGroup: wrap SpanC in shared_ptr (#9869) * Span/SpanGroup: wrap SpanC in shared_ptr When a Span that was retrieved from a SpanGroup was modified, these changes were not reflected in the S... — committed to jordankanter/spaCy by danieldk 2 years ago
- Span/SpanGroup: wrap SpanC in shared_ptr (#9869) * Span/SpanGroup: wrap SpanC in shared_ptr When a Span that was retrieved from a SpanGroup was modified, these changes were not reflected in the S... — committed to jordankanter/spaCy by danieldk 2 years ago
- Span/SpanGroup: wrap SpanC in shared_ptr (#9869) * Span/SpanGroup: wrap SpanC in shared_ptr When a Span that was retrieved from a SpanGroup was modified, these changes were not reflected in the S... — committed to jordankanter/spaCy by danieldk 2 years ago
Memory management and dealing with the
docreferences in theSpanobjects (vs.SpanCstructs) is indeed the problem. We have discussed this internally without coming up with a good general solution thus far. We used to make all spans read-only, but that also has a lot of drawbacks. We can have a look at the setter idea, but it’s still going to be confusing for people who just try to modify the span directly.