spaCy: Adding patterns to EntityRuler and deserializing EntityRuler very slow
Adding a large number of patterns to the EntityRuler, or loading a saved EntityRuler that has a large number of patterns is extremely slow.
This is easily reproduced with:
nlp = spacy.load('en_core_web_sm')
entityruler = EntityRuler(nlp)
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
entityruler .add_patterns(patterns)
and
nlp = spacy.load('en_core_web_sm')
patterns = [nlp.make_doc(str(i)) for i in range(1000000)]
phrasematcher= PhraseMatcher(nlp.vocab)
phrasematcher.add("TEST", None, *patterns)
the EntityRuler code takes around 10 minutes to execute on an m5.4xlarge AWS SageMaker Notebook instance, while the PhraseMatcher code takes 20 seconds. Changing nlp.make_doc(str(i)) to nlp(str(i)) slows the PhraseMatcher down to the speed of the EntityRuler.
Looking through the code for the EntityRuler, it employs the PhraseMatcher and should be similar in speed, but is using nlp(pattern) with the full nlp pipeline (tagger, parser, ner) instead of using nlp.make_doc(pattern) as recommended for the PhraseMatcher in https://spacy.io/usage/rule-based-matching#phrasematcher
Since EntityRuler uses add_patterns() when deserializing this also slows down from_bytes() and from_disk() by a considerable amount.
This is fortunately an easy fix, just change the offending line
Current Line 187 in https://github.com/explosion/spaCy/blob/master/spacy/pipeline/entityruler.py:
self.phrase_patterns[label].append(self.nlp(pattern))
Updated Line 187:
self.phrase_patterns[label].append(self.nlp.make_doc(pattern))
This puts pattern adding and loading times in line with the PhraseMatcher as expected and doesn’t appear to break anything.
Your Environment
- spaCy version: 2.1.8
- Platform: Amazon Linux AMI 2018.03
- Python version: 3.6.5
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 27 (18 by maintainers)
This will break the EntityRuler if the
phrase_matcher_attris set to a value that requires more processing than tokenization.LEMMA,POS,TAG, andDEPare supported by the PhraseMatcher. (I can’t imagine a sensible entityDEPpattern, but I’m sure there are some creative patterns out there.)Could it make sense to use
nlp.make_doc()except forphrase_matcher_attr in (LEMMA, POS, TAG, DEP)or am I missing something this might break?