spaCy: v2 standard pipeline running 10x slower

Your Environment

Info about spaCy

  • Python version: 2.7.13
  • Platform: Linux-4.10.0-38-generic-x86_64-with-debian-stretch-sid
  • spaCy version: 2.0.0
  • Models: en

I just updated to v2.0. Not sure what changed, but the exact same pipeline of documents called in the standard nlp = spacy.load('en'); nlp(u"string") way is now 10x slower.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 22
  • Comments: 56 (26 by maintainers)

Commits related to this issue

Most upvoted comments

I think we can finally close this šŸŽ‰

spacy-nightly now ships with matrix multiplications provided by our Blis package, in single-thread mode. I’ve also performed extensive hyper-parameter search to make the models smaller, and made sure that Pickle is working again.

You should be seeing between 7000 and 9000 words per second per thread. With the pipeline properly single-threaded, you should now find it easy to use multi-processing to speed up your execution.

The best way to tell which Blas your numpy is linked to is to do:

ldd `python -c "import numpy.core; print(numpy.core.multiarray.__file__)"`

Example output:

ldd `python -c "import numpy.core; print(numpy.core.multiarray.__file__)"`
    linux-vdso.so.1 =>  (0x00007ffdb0d21000)
    libmkl_rt.so => /home/matt/.pyenv/versions/miniconda-latest/lib/python2.7/site-packages/numpy/core/../../../../libmkl_rt.so (0x00007fbd0ad10000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbd0aad3000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbd0a7ca000)
    libpython2.7.so.1.0 => /home/matt/.pyenv/versions/miniconda-latest/lib/python2.7/site-packages/numpy/core/../../../../libpython2.7.so.1.0 (0x00007fbd0a3ed000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbd0a026000)
    /lib64/ld-linux-x86-64.so.2 (0x000055cfc12ea000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbd09e20000)
    libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fbd09c1d000)

That installation has numpy linked against MKL. When numpy is linked against OpenBLAS, it looks like this:

ldd `python -c "import numpy.core; print(numpy.core.multiarray.__file__)"`
    linux-vdso.so.1 =>  (0x00007fff395b8000)
    libopenblasp-r0-39a31c03.2.18.so => /home/matt/repos/spacy-v2/.env/lib/python2.7/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so (0x00007f339ec1e000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f339e8f6000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f339e6d8000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f339e311000)
    /lib64/ld-linux-x86-64.so.2 (0x000056361421d000)
    libgfortran-ed201abd.so.3.0.0 => /home/matt/repos/spacy-v2/.env/lib/python2.7/site-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0 (0x00007f339e017000)

Of note here is that the OpenBLAS version is 2.18, which is the version all my machines get by default…But this version has a significant bug that causes quite poor performance 😦.

To check whether your numpy’s Blas is basically decent, you can benchmark it against a wrapper I made for the Blis linear algebra library:

pip install blis
python -m blis.benchmark

When I’m linked against the bad OpenBLAS, I get this:

Setting up data nO=384 nI=384 batch_size=2000
Blis...
11032014.6484
7.62 seconds
Numpy...
11032016.6016
18.82 seconds

When I’m linked against MKL, I get this:

Setting up data nO=384 nI=384 batch_size=2000
Blis...
11032014.6484
7.17 seconds
Numpy...
11032011.7188
4.01 seconds

Correctly compiled versions of OpenBLAS give similar performance — but unfortunately, that’s a total pain in the ass. Unfortunately running conda is often hugely inconvenient too! I’ve still not managed to get my servers automatically deploying into conda environments.

You might be able to tell I’ve spent a lot of time on this…I wish I could say I’ve come to a better conclusion. The good news is that it won’t be long before the bug in OpenBLAS 2.18 rolls past, and we’re all getting better CPU neural network performance by default. I believe that this unfortunate problem has contributed to a lot of NLP researchers greatly over-estimating the performance advantage of GPUs for their use-case.

For now, my best advice is as follows:

  1. Run the Blis benchmark, and see if your numpy performance looks okay.
  2. If not, try to get numpy through conda, to get it linked to MKL. Then install spacy via pip in the same environment.
  3. Try to use multi-processing to work your cores, instead of multi-threading. I want to switch nlp.pipe() to multi-processing by default — it should give much better performance. Multi-threading is currently only used during the matrix multiplications, which only works if the matrices are large.

Just to note, i’m experiencing the exact same issue. Large numbers of very small documents passed to nlp.pipe is much (more than 10x) slower than those same documents passed to nlp individually.

I confirm the slow-down. Python 3.5 here, 64-bit Linux, numpy==1.13.3, libopenblasp-r0-39a31c03.2.18.so

$ python benchmark.py  tagger,parser
components= ['tagger', 'parser'] disabled= ['ner'] use_pipe= 0 wps= 1171.5999451701907 words= 56068 seconds= 47.855925763000414

$ python benchmark.py -p tagger,parser 
components= ['tagger', 'parser'] disabled= ['ner'] use_pipe= 1 wps= 5499.778332879291 words= 56068 seconds= 10.194592692001606

This is a practical issue for many use cases, including implementing web services for relatively short documents (e.g. chat messages).

I’ve attached a benchmark script for running the pipeline on some IMDB text using multi-processing with joblib. With en_core_web_sm using the single-threaded Thinc wheel I posted, I’m seeing:

j=1 6,500 wps
j=2 10,103 wps
j=3  11,133 wps

My machine is allegedly 4 core, but as you can see performance is pretty terrible with more than two active processes. I wonder whether I have power management throttling me or something. I’ve confirmed the same behaviour applies if I simply run the command twice…

import plac
from timeit import default_timer as timer
import spacy
import joblib
from toolz import partition_all

def count_words(model_loc, texts):
    nlp = spacy.load(model_loc)
    n_words = 0
    for doc in nlp.pipe(texts):
        n_words += len(doc)
    return n_words

@plac.annotations(
    model=("spaCy model to load (path or name)", "positional", None, str),
    n_jobs=("Number of processes", "option", "j", int)
)
def main(model, n_jobs=2):
    train, test = imdb()
    texts = [text for (text, label) in train[:10000]]
    chunk_size = int(len(texts) / n_jobs)
    begin = timer()
    partitions = partition_all(chunk_size, texts)
    executor = joblib.Parallel(n_jobs=n_jobs)
    do = joblib.delayed(count_words)
    tasks = (do(model, batch) for batch in partitions)
    word_counts = executor(tasks)
    end = timer()
    n_words = sum(word_counts)
    print(n_words, end-begin, n_words/(end-begin))

if __name__ == '__main__':
    plac.call(main)

Okay new plan:

  • Bring our own BLAS (or specifically, gemm)
  • Our gemm is single-threaded.
  • Implement for low latency per process (i.e. don’t assume large batches)
  • Rely on multi-processing for parallelisation

The models fit well in memory anyway, so one copy of the model per processor isn’t so bad. And now that wheels and conda are more standard, it’s easier to bring our own gemm. If necessary we can just statically link OpenBLAS. It’s pretty small when compiled without Lapack or Fortran, for singe-threaded.

I think this approach has the best chance of always doing sensible things. It means the same story applies to all the units in the library, rather than just the ones we happen to multi-thread.

Multi-threading is also only really attractive when the threads would otherwise sit idle. In production settings that’s really not the case. Even if your machine is large, you’d rather just run more containers.

I did some bench-marking. I don’t know if this provides you with any clues or just tells you what you already know. Here’s the set-up and the results.

One observation: no matter how configured, the python3 process spends about half it’s time in the kernel! That must a clue to the underlying problem.

More detail. I was wondering why it was spending so much time in the kernel, so I ran strace on the python3 process. What I see is it reading my text files, calling futex() eleven times (FUTEX_WAKE_OP_PRIVATE and FUTEX_WAKE_PRIVATE), and then just hundreds of calls to sched_yield(). This could lead to excessive context switches. Using /usr/bin/time, I saw an average of 8800 context switches per second for this process. Seems high? I’m attaching some sample output from strace. spacy-trace.txt

I also tried OpenBLAS 0.2.20, but it didn’t change anything.

System:

CPU: 2013 Intel Core i7-4770R @ 3.20 GHz x 4 ("Haswell-H" with 6 MB L3 cache and 128 MB L4 cache)
RAM: 16 GB
OS: Linux Mint 18.3, kernel 4.13.0 64-bit
Python: 3.5.2
Numpy BLAS: OpenBLAS 2.18
python3 -m blis.benchmark
	Setting up data nO=384 nI=384 batch_size=2000
	Blis...
	11032014.6484375
	6.90 seconds
	Numpy...
	11032015.625
	3.95 seconds
SpaCy: 2.0.5
Model: en_core_web_sm version 2.0.0
Text: Reuters financial news from 1987 in English as plain text files, one file per article.

The code:

import nltk
from nltk.corpus import reuters
import spacy

lang_model = spacy.load('en')

The code, serial version:

for fileid in nltk.corpus.reuters.fileids():
	doc = lang_model(reuters.raw(fileid))
	print("\n============================================");
	for entity_mention in doc.ents:
		print(entity_mention.text)

The code, multi-threaded pipe version:

text_iterator = (reuters.raw(fileid) for fileid in nltk.corpus.reuters.fileids())
for doc in lang_model.pipe(text_iterator, batch_size=50, n_threads=-1):
	print("\n============================================");
	for entity_mention in doc.ents:
		print(entity_mention.text)

Some measurements:

Articles: 10,788
Characters: 8,264,923 = 7.88 MB

Tokens found: 1,553,843
Sentences found: 70,376
Entities found: 213,626

Some calculations:

Characters/article: 766
Tokens/article: 144
Sentences/article: 6.5
Characters/Token: 5.3
Tokens/sentence: 22.1
Characters/sentence: 117

Entities/article: 19.8
Chars/entity: 39
Tokens/entity: 7.3
Entities/sentence: 3.0

Performance, tokens per second:

config  real   user   sys  %sys    TPS
serial   426   1627  1751    52   3646
t=1      281   1004   955    49   5522
t=2      279   1003   938    48   5576
t=4      283   1012   960    49   5486
t=-1     275    985   930    49   5640
t=8      270    970   919    49   5763

+1 on @gholmberg 's https://github.com/explosion/spaCy/issues/1508#issuecomment-357395899 and @pengyu 's https://github.com/explosion/spaCy/issues/1839#issuecomment-359329109 remarks regarding kernel-time, here’s the htop of a very simple nlp.pipe() command on an amazon ec2 m5.2xlarge instance:

spacy htop 2

You can see the main thread is 100% in user-land, but the remaining threads are >50% kernel-land. I’m seeing a 5% performance improvement going from 2x cores to 8x cores.

Still fast though 😃

The issue seems to be that the neural network model is much slower than the previous linear model. For about 15k documents with an average of 120 words 1.9 is about 5x faster than the best combination of n_threads/batch_size for v 2.05.

Is it possible to ship the older models with spacy 2.0? I don’t want to move back to 1.9 and lose out the other improvements that have been made.

Blis benchmarks vs OpenBlas Setting up data nO=384 nI=384 batch_size=2000 Blis… 11032014.6484 11.69 seconds Numpy… 11032015.625 11.46 seconds

FWIW, we have been running Spacy v1 on a server behind Uwsgi/Flask, and on our (Linux/Mac) systems, copy-on-write seems to avoid the multiplying of memory when Uwsgi forks the app (including spacy models). I.e. we run 2 or 4 forked processes but only incur the memory cost of one.

We haven’t done any detailed profiling, but I would expect that memory use eventually does increase somewhat, because of the growing vocabulary (from what I understand, this means affected memory pages will be duplicated across processes).

But it would be nice if this was a ā€œfeatureā€ also of Spacy v2 (we haven’t tested it yet, but this thread seems to suggest it may not be that easy).

Btw…For some reassurance here: If we can’t get this sorted out, I could update the linear model code and just train new linear models for v2. It would just be a different parser class. So even if the problem proves difficult, the worst-case solution isn’t so bad.

I’d much rather figure out why the neural networks are getting uneven performance though. Too much choice is bad. If we add more options, people will have to run experiments to figure out what to use. It’s also really hard to make good trade-offs if everyone’s using different stuff.

Probably the best solution will involve writing more of the forward-pass in Cython, and training models with narrower hidden layers. At the moment the hidden layer sizes have been configured with the current overheads from making a bunch of Python calls in mind. If we eliminate those overheads, performing less computation in the matrix multiplications starts to become better value.

I just ran an experiment with nlp.pipe(), with batches of 10k documents, and 8/16 threads (I’m running a 32Gb/8 core setup). I’m seeing 700 documents per second, and each document has an average of 4.5 words. This translates to roughly 3k words per second.

I am now really missing the previous version… Is it different data structures, improvements in parallelization? Would be great to have the option of the faster, more memory-intensive processing.

For reference, I dropped back to 1.9 and tried the same nlp.pipe() approach. I am seeing 50% more memory usage in v1.9, but a definitive 10x slow-down in v2.

I was curious as to whether this might have something to do with short documents. So, I tried passing in documents that were 10x longer (50 words). I see a big improvement in the v2 processing (twice as fast), but still 7-8x slower than v1.9. Memory ratios seem approximately equivalent under these scenarios.

If you switch to nlp.pipe(), you should get between 8,000 and 10,000 words per second. In comparison the v1 models would be getting 15k-30k, but with much higher memory usage.

Overall the v2 models are cheaper to run, because you can use them on smaller instances. However it’s true that the speed is much worse if you call it document-by-document. To make using nlp.pipe() easier, there’s now a as_tuples keyword argument, that lets you pass in an iterator of (text, context) pairs, so you can get back an iterator of (doc, context) tuples.