spaCy: 5-6x slower performance between v2 and v3

Regression

With Spacy 2 we’ve experienced 5-6 times faster processing of documents, with same model “size” (not transformer).

With `nlp(text)`

In Spacy 2: For text of length 789417 (~780KB) it takes ~56 seconds for full parse With pipeline = [‘tagger’, ‘parser’, ‘ner’]
In Spacy 3: For text of length 789417 (~780KB) it takes ~303 seconds for full parse With pipeline = [‘tok2vec’, ‘tagger’, ‘parser’, ‘attribute_ruler’, ‘lemmatizer’, ‘ner’]

With `nlp.pipe([chunks)]` and 16 chunks

In Spacy 2: For text of length 789417 (~780KB) it takes ~38 seconds for full parse
In Spacy 3: For text of length 789417 (~780KB) it takes ~203 seconds for full parse

How to reproduce the behaviour

Created a Colab with dropdown to choose Spacy ❤️ or >3 (Please choose “restart runtime and run” after switching between spacy versions) https://colab.research.google.com/drive/10XUpSQ6ir5-DttWJsUN5SjjVQfCDmglN?usp=sharing

I’ve plot the analyze time as relative to text length…

(ratio is attached to right axis, so it goes 2->3->4->5+)

#@title Spacy 3 vs 2: benchmark large text

#@markdown **1. Download a rather large document**

import urllib
import urllib.request
large_text_file_url = "https://www.gutenberg.org/files/1342/1342-0.txt"

urllib.request.urlretrieve(large_text_file_url, "large.txt")  # download
large_txt = urllib.request.urlopen(large_text_file_url).read().decode('utf-8')

print(f"Length of large_txt={len(large_txt)}")
text_to_parse_dropdown = "large" #@param ["large", "small"]
text_to_parse = large_txt if text_to_parse_dropdown == "large" else "Hello"
print(f"Length of text_to_parse={len(text_to_parse)}")

#@markdown **2. Install spacy, download model, import & load model**
spacy_version = "spacy\u003C3" #@param ["spacy\u003C3", "spacy>3"]
import os; os.environ["spacy_version"] = spacy_version
!pip install "$spacy_version"
!python -m spacy download en_core_web_md
import en_core_web_md
nlp = en_core_web_md.load()

#@markdown **3. Run full parse, benchmark**
import timeit
results = timeit.timeit(lambda: nlp(text_to_parse), number=3)
print(results)
print(f'For text of length {len(text_to_parse)} it took {results}')
results

Note: Disabling modules from Spacy 3 model doesn’t seem to help

(using disable = ['tok2vec', 'attribute_ruler', 'lemmatizer'] in colab)

Your Environment

Operating System: Ubuntu / WSL@Windows / Colab - all reproduce
Python Version Used: 2 / 3
spaCy Version Used: 2.3 / 3.2
Environment Information: Cloudy

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 18 (5 by maintainers)

Commits related to this issue

Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L would construct a vector of all left-arcs with the given... — committed to danieldk/spaCy by danieldk 2 years ago
Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L would construct a vector of all left-arcs with the given... — committed to danieldk/spaCy by danieldk 2 years ago
Speed up the StateC::L feature function (#10019) * Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L w... — committed to explosion/spaCy by danieldk 2 years ago
Speed up the StateC::L feature function (#10019) * Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L w... — committed to danieldk/spaCy by danieldk 2 years ago
Use constant-time head lookups in StateC::{L,R} This change changes the type of left/right-arc collections from vector[ArcC] to unordered_map[int, vector[Arc]], so that the arcs are keyed by the head... — committed to danieldk/spaCy by danieldk 2 years ago
Use constant-time head lookups in StateC::{L,R} This change changes the type of left/right-arc collections from vector[ArcC] to unordered_map[int, vector[Arc]], so that the arcs are keyed by the head... — committed to danieldk/spaCy by danieldk 2 years ago
Speed up the StateC::L feature function (#10019) * Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L w... — committed to polm/spaCy by danieldk 2 years ago

Most upvoted comments

Fixed by #10048.

danieldk on Jan 20, 2022

Sorry for the delay. We are looking into this, but the holidays have delayed this a bit, but it’s definitely on our radar.

danieldk on Jan 10, 2022