spaCy: Adding patterns to EntityRuler and deserializing EntityRuler very slow

Adding a large number of patterns to the EntityRuler, or loading a saved EntityRuler that has a large number of patterns is extremely slow.

This is easily reproduced with:

nlp = spacy.load('en_core_web_sm') entityruler = EntityRuler(nlp) patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)] entityruler .add_patterns(patterns)

and

nlp = spacy.load('en_core_web_sm') patterns = [nlp.make_doc(str(i)) for i in range(1000000)] phrasematcher= PhraseMatcher(nlp.vocab) phrasematcher.add("TEST", None, *patterns)

the EntityRuler code takes around 10 minutes to execute on an m5.4xlarge AWS SageMaker Notebook instance, while the PhraseMatcher code takes 20 seconds. Changing nlp.make_doc(str(i)) to nlp(str(i)) slows the PhraseMatcher down to the speed of the EntityRuler.

Looking through the code for the EntityRuler, it employs the PhraseMatcher and should be similar in speed, but is using nlp(pattern) with the full nlp pipeline (tagger, parser, ner) instead of using nlp.make_doc(pattern) as recommended for the PhraseMatcher in https://spacy.io/usage/rule-based-matching#phrasematcher

Since EntityRuler uses add_patterns() when deserializing this also slows down from_bytes() and from_disk() by a considerable amount.

This is fortunately an easy fix, just change the offending line Current Line 187 in https://github.com/explosion/spaCy/blob/master/spacy/pipeline/entityruler.py: self.phrase_patterns[label].append(self.nlp(pattern)) Updated Line 187: self.phrase_patterns[label].append(self.nlp.make_doc(pattern))

This puts pattern adding and loading times in line with the PhraseMatcher as expected and doesn’t appear to break anything.

Your Environment

  • spaCy version: 2.1.8
  • Platform: Amazon Linux AMI 2018.03
  • Python version: 3.6.5

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 27 (18 by maintainers)

Most upvoted comments

This will break the EntityRuler if the phrase_matcher_attr is set to a value that requires more processing than tokenization. LEMMA, POS, TAG, and DEP are supported by the PhraseMatcher. (I can’t imagine a sensible entity DEP pattern, but I’m sure there are some creative patterns out there.)

Could it make sense to use nlp.make_doc() except for phrase_matcher_attr in (LEMMA, POS, TAG, DEP) or am I missing something this might break?