spaCy: SpanGroup doesn't allow access to original spans

How to reproduce the behaviour

Use case: I use a custom entity ruler to populate doc.spans, a SpanGroup per entity label, to allow overlapping spans between different entity types. After the NER step is completed, I check for certain conditions and for some spans I change kb_id. Looks like SpanGroup class creates copies of Spans when iterating through them. So the assigned kb_id gets lost. There is no elegant solution to the problem except for iterating through each Span in a SpanGroup and create a new group each time kb_id needs to be re-assigned. There is no way to delete or replace a particular span from SpanGroup.

Here is code example:

Set up

import spacy
from spacy.tokens import Span, SpanGroup

nlp = spacy.load("en_core_web_sm")

text = "Span 1, Span 2"
doc = nlp(text)

nlp('kb_id_2') # add to the vocab
span_1 = Span(doc, 0, 2, label = "TEST", kb_id = 'kb_id_1')

Iterating through SpanGroup and re-assigning kb_id

doc.spans['TEST'] = [span_1]
for span in doc.spans['TEST'] :
    print("Before: %s" % span.kb_id_)
    span.kb_id = nlp.vocab.strings['kb_id_2']
    print("After: %s" % span.kb_id_)

Output:

Before: kb_id_1
After: kb_id_2

Now, iterate through the SpanGroup again:

for span in doc.spans['TEST'] :
    print(span, "=>", span.kb_id_)

Output (unchanged kb_id):

Span 1 => kb_id_1

Any help will be greatly appreciated.

Your Environment

  • spaCy version: 3.0.6
  • Platform: Darwin-20.3.0-x86_64-i386-64bit
  • Python version: 3.7.9
  • Pipelines: en_core_web_lg (3.0.0), en_core_web_md (3.0.0), en_core_web_sm (3.0.0)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Memory management and dealing with the doc references in the Span objects (vs. SpanC structs) is indeed the problem. We have discussed this internally without coming up with a good general solution thus far. We used to make all spans read-only, but that also has a lot of drawbacks. We can have a look at the setter idea, but it’s still going to be confusing for people who just try to modify the span directly.