spaCy: v2 standard pipeline running 10x slower
Your Environment
Info about spaCy
- Python version: 2.7.13
- Platform: Linux-4.10.0-38-generic-x86_64-with-debian-stretch-sid
- spaCy version: 2.0.0
- Models: en
I just updated to v2.0. Not sure what changed, but the exact same pipeline of documents called in the standard nlp = spacy.load('en'); nlp(u"string") way is now 10x slower.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 22
- Comments: 56 (26 by maintainers)
Commits related to this issue
- Add note on stream processing to migration guide (see #1508) — committed to explosion/spaCy by ines 7 years ago
- Tuned for CentOS cluster — committed to usc-isi-i2/dig-text-similarity-search by Ljferrer 6 years ago
I think we can finally close this š
spacy-nightlynow ships with matrix multiplications provided by our Blis package, in single-thread mode. Iāve also performed extensive hyper-parameter search to make the models smaller, and made sure that Pickle is working again.You should be seeing between 7000 and 9000 words per second per thread. With the pipeline properly single-threaded, you should now find it easy to use multi-processing to speed up your execution.
The best way to tell which Blas your numpy is linked to is to do:
Example output:
That installation has numpy linked against MKL. When numpy is linked against OpenBLAS, it looks like this:
Of note here is that the OpenBLAS version is 2.18, which is the version all my machines get by defaultā¦But this version has a significant bug that causes quite poor performance š¦.
To check whether your numpyās Blas is basically decent, you can benchmark it against a wrapper I made for the Blis linear algebra library:
When Iām linked against the bad OpenBLAS, I get this:
When Iām linked against MKL, I get this:
Correctly compiled versions of OpenBLAS give similar performance ā but unfortunately, thatās a total pain in the ass. Unfortunately running conda is often hugely inconvenient too! Iāve still not managed to get my servers automatically deploying into conda environments.
You might be able to tell Iāve spent a lot of time on thisā¦I wish I could say Iāve come to a better conclusion. The good news is that it wonāt be long before the bug in OpenBLAS 2.18 rolls past, and weāre all getting better CPU neural network performance by default. I believe that this unfortunate problem has contributed to a lot of NLP researchers greatly over-estimating the performance advantage of GPUs for their use-case.
For now, my best advice is as follows:
nlp.pipe()to multi-processing by default ā it should give much better performance. Multi-threading is currently only used during the matrix multiplications, which only works if the matrices are large.Just to note, iām experiencing the exact same issue. Large numbers of very small documents passed to nlp.pipe is much (more than 10x) slower than those same documents passed to nlp individually.
I confirm the slow-down. Python 3.5 here, 64-bit Linux, numpy==1.13.3, libopenblasp-r0-39a31c03.2.18.so
This is a practical issue for many use cases, including implementing web services for relatively short documents (e.g. chat messages).
Iāve attached a benchmark script for running the pipeline on some IMDB text using multi-processing with joblib. With
en_core_web_smusing the single-threaded Thinc wheel I posted, Iām seeing:My machine is allegedly 4 core, but as you can see performance is pretty terrible with more than two active processes. I wonder whether I have power management throttling me or something. Iāve confirmed the same behaviour applies if I simply run the command twiceā¦
Okay new plan:
The models fit well in memory anyway, so one copy of the model per processor isnāt so bad. And now that wheels and conda are more standard, itās easier to bring our own gemm. If necessary we can just statically link OpenBLAS. Itās pretty small when compiled without Lapack or Fortran, for singe-threaded.
I think this approach has the best chance of always doing sensible things. It means the same story applies to all the units in the library, rather than just the ones we happen to multi-thread.
Multi-threading is also only really attractive when the threads would otherwise sit idle. In production settings thatās really not the case. Even if your machine is large, youād rather just run more containers.
I did some bench-marking. I donāt know if this provides you with any clues or just tells you what you already know. Hereās the set-up and the results.
One observation: no matter how configured, the python3 process spends about half itās time in the kernel! That must a clue to the underlying problem.
More detail. I was wondering why it was spending so much time in the kernel, so I ran strace on the python3 process. What I see is it reading my text files, calling futex() eleven times (FUTEX_WAKE_OP_PRIVATE and FUTEX_WAKE_PRIVATE), and then just hundreds of calls to sched_yield(). This could lead to excessive context switches. Using /usr/bin/time, I saw an average of 8800 context switches per second for this process. Seems high? Iām attaching some sample output from strace. spacy-trace.txt
I also tried OpenBLAS 0.2.20, but it didnāt change anything.
System:
The code:
The code, serial version:
The code, multi-threaded pipe version:
Some measurements:
Some calculations:
Performance, tokens per second:
+1 on @gholmberg 's https://github.com/explosion/spaCy/issues/1508#issuecomment-357395899 and @pengyu 's https://github.com/explosion/spaCy/issues/1839#issuecomment-359329109 remarks regarding kernel-time, hereās the htop of a very simple
nlp.pipe()command on an amazon ec2 m5.2xlarge instance:You can see the main thread is 100% in user-land, but the remaining threads are >50% kernel-land. Iām seeing a 5% performance improvement going from 2x cores to 8x cores.
Still fast though š
The issue seems to be that the neural network model is much slower than the previous linear model. For about 15k documents with an average of 120 words 1.9 is about 5x faster than the best combination of n_threads/batch_size for v 2.05.
Is it possible to ship the older models with spacy 2.0? I donāt want to move back to 1.9 and lose out the other improvements that have been made.
Blis benchmarks vs OpenBlas Setting up data nO=384 nI=384 batch_size=2000 Blis⦠11032014.6484 11.69 seconds Numpy⦠11032015.625 11.46 seconds
FWIW, we have been running Spacy v1 on a server behind Uwsgi/Flask, and on our (Linux/Mac) systems, copy-on-write seems to avoid the multiplying of memory when Uwsgi forks the app (including spacy models). I.e. we run 2 or 4 forked processes but only incur the memory cost of one.
We havenāt done any detailed profiling, but I would expect that memory use eventually does increase somewhat, because of the growing vocabulary (from what I understand, this means affected memory pages will be duplicated across processes).
But it would be nice if this was a āfeatureā also of Spacy v2 (we havenāt tested it yet, but this thread seems to suggest it may not be that easy).
Btwā¦For some reassurance here: If we canāt get this sorted out, I could update the linear model code and just train new linear models for v2. It would just be a different parser class. So even if the problem proves difficult, the worst-case solution isnāt so bad.
Iād much rather figure out why the neural networks are getting uneven performance though. Too much choice is bad. If we add more options, people will have to run experiments to figure out what to use. Itās also really hard to make good trade-offs if everyoneās using different stuff.
Probably the best solution will involve writing more of the forward-pass in Cython, and training models with narrower hidden layers. At the moment the hidden layer sizes have been configured with the current overheads from making a bunch of Python calls in mind. If we eliminate those overheads, performing less computation in the matrix multiplications starts to become better value.
I just ran an experiment with
nlp.pipe(), with batches of 10k documents, and 8/16 threads (Iām running a 32Gb/8 core setup). Iām seeing 700 documents per second, and each document has an average of 4.5 words. This translates to roughly 3k words per second.I am now really missing the previous version⦠Is it different data structures, improvements in parallelization? Would be great to have the option of the faster, more memory-intensive processing.
For reference, I dropped back to 1.9 and tried the same
nlp.pipe()approach. I am seeing 50% more memory usage in v1.9, but a definitive 10x slow-down in v2.I was curious as to whether this might have something to do with short documents. So, I tried passing in documents that were 10x longer (50 words). I see a big improvement in the v2 processing (twice as fast), but still 7-8x slower than v1.9. Memory ratios seem approximately equivalent under these scenarios.
If you switch to
nlp.pipe(), you should get between 8,000 and 10,000 words per second. In comparison the v1 models would be getting 15k-30k, but with much higher memory usage.Overall the v2 models are cheaper to run, because you can use them on smaller instances. However itās true that the speed is much worse if you call it document-by-document. To make using
nlp.pipe()easier, thereās now aas_tupleskeyword argument, that lets you pass in an iterator of(text, context)pairs, so you can get back an iterator of(doc, context)tuples.