scrapy: Scrapy under Python 3 is slower than under Python 2

bookworm benchmark from https://github.com/scrapy/scrapy-bench/ (see also https://medium.com/@vermaparth/parth-gsoc-f5556ffa4025) shows about 15% slowdown, while more synthetic scrapy bench shows a 2x slowdown: https://github.com/scrapy/scrapy/pull/3050#issuecomment-353863711

About this issue

Original URL
State: open
Created 6 years ago
Comments: 22 (22 by maintainers)

Most upvoted comments

I think I would start from comparing profiles (that means running under profiler) under Py2.7 and Py3.6 for several benchmarks and trying to spot where most of the difference comes from. For benchmarks, I think it makes sense to check several of them, because some might show more difference and some will be easier to analyze. For profilers, if you already have some preference then go with it. If not, I would suggest using built-in cProfile with some visualization backend (e.g. snakeviz), and vmprof + vmprof.com for visualization - it’s good to have several different profilers because this allows to cross-check profiling results.

lopuhin on Feb 25, 2018

I meant that the difference in pages/min is 2x (4320 vs 2040), while in total pagese crawls is about 30% (653 vs 421).

I tried running scrapy bench again and again and result is really unpredictable. However, the speed difference between python 2 and python 3 is still really high (about 30%-40%). But then again, the result I copied from the last line of the log is pretty weird. So I made a line of report which divides the pages crawled to the elapsed time, then I ran the benchmarker again and got this result: Python2:

2018-03-13 02:11:58 [scrapy.extensions.logstats] INFO: Crawled 605 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
('the result is', 621, 'pages and', 3542.474324191734, 'pages/min over', datetime.timedelta(0, 10, 518072))

Python3:

2018-03-13 02:13:38 [scrapy.extensions.logstats] INFO: Crawled 405 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
the result is 420 pages and 2325.8044029876987 pages/min over 0:00:10.834961

As you can see the result is basically the same. But I noticed that the spider speed is much slower over time. I will take the log on python 3 as an example:

2018-03-13 02:13:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:29 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 3600 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:30 [scrapy.extensions.logstats] INFO: Crawled 101 pages (at 2460 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:31 [scrapy.extensions.logstats] INFO: Crawled 156 pages (at 3300 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:32 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:33 [scrapy.extensions.logstats] INFO: Crawled 236 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:34 [scrapy.extensions.logstats] INFO: Crawled 269 pages (at 1980 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:35 [scrapy.extensions.logstats] INFO: Crawled 309 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:36 [scrapy.extensions.logstats] INFO: Crawled 340 pages (at 1860 pages/min), scraped 0 items (at 0 items/min)
2018-03-13 02:13:37 [scrapy.extensions.logstats] INFO: Crawled 373 pages (at 1980 pages/min), scraped 0 items (at 0 items/min)

This also happens on Python 2 but it’s not that much. I guess that’s where the difference comes from.

nctl144 on Mar 13, 2018

@nctl144 sure, the main difference of vmprof is that it is a statistical profiler: https://en.wikipedia.org/wiki/Profiling_(computer_programming)#Statistical_profilers so it does not tell how many times a function was called, so it’s useful to check cProfile results as well. But as you many have noticed, it has lower overhead, especially when there are a lot of calls of small and fast functions.

Default mode of vmprof display is called a “flamegraph”, but there are two more modes reachable by the buttons on the bottom left in the screenshot screen shot 2018-03-02 at 14 38 29

There you can get %-age of time spent in the function (both “safe” and “cumulative”).

lopuhin on Mar 2, 2018

@lopuhin Aight I got that! What I am doing now is concentrating on profiling the performance of bookworm and scrapy bench on python2 and python3. Profiling with Snakeviz is fine since they have a doc which describes the profiling result in detail. However, since this is the first time I use vmprof so I don’t really know what information in the result page represents. I tried to look for the answers but I found none. Therefore, can you help me briefly explain the main parts that I need to pay attention to when looking at the vmprof result? I hope you dont mind 😃 Thank you so much! 😄

nctl144 on Mar 2, 2018

Is there any other suggestion that you can give me here to further develop the project idea?

Not really, I think first step is to measure and analyze the differences and where they come from, like hypothetically “urljoin (taking 20%) time overall is taking 100% more time under python 3, likely because of this and that”. Also I think it makes sense to profile scrapy bench too - I remember the difference there was also quite large.

lopuhin on Mar 1, 2018

Interesting results @nctl144 ! Btw, this issue https://github.com/scrapy/scrapy/issues/1306 could be also relevant here.

lopuhin on Feb 28, 2018