scrapy: Scrapy is definitely slow when working from cache

Description

I’m using scrapy with cache enabled to first crawl the pages I need over night and then polish the extraction, while working from cache. Suddenly, the speed of the processing of cached is pages quite slow, below 2k pages per minute. My data processing is trivial, my datastorage is mongo (I’ve tried to disable it to eliminate extra-factor, but that didn’t affected the speed), my CPU/IO isn’t even sweating. I’ve tried to bump CONCURRENT_ITEMS to a higher value, but got no result. I’m using twisted reactor. More than that, on decent internet connection my crawling speed on empty cache is roughly the same (1800 items per second).

Below are my cache settings.

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = "httpcache"
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

Steps to Reproduce

  1. Crawl the website and cache all the pages
  2. Rerun the spider on cache
  3. Watch slow processing speed

Expected behavior: Reasonably higher parsing speed when working from cache

Actual behavior: Speed of parsing is quite low and basically the same as with empty cache on decent connection.

Reproduces how often: All of my spiders are suffering from this behavior.

Versions

Scrapy       : 2.5.0
lxml         : 4.6.3.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.2.0
Python       : 3.9.1 (default, Apr 12 2021, 01:27:54) - [Clang 10.0.0 (clang-1000.10.44.4)]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021)
cryptography : 3.4.7
Platform     : macOS-10.13.6-x86_64-i386-64bit

Additional context

My full settings.py is below:

import os
from urllib.parse import quote_plus


def get_env_str(k, default):
    return os.environ.get(k, default)


def get_env_int(k, default):
    return int(get_env_str(k, default))


BOT_NAME = "corpora"

SPIDER_MODULES = ["corpora.spiders"]
NEWSPIDER_MODULE = "corpora.spiders"


ROBOTSTXT_OBEY = False


AUTOTHROTTLE_ENABLED = False
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = "httpcache"
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"


ITEM_PIPELINES = {
    "scrapy.pipelines.files.FilesPipeline": 1,
    "corpora.pipelines.MongoDBPipeline": 9000,
}

FILES_STORE = "nanu_pdfs"

DOWNLOAD_WARNSIZE = 3355443200
DOWNLOAD_TIMEOUT = 1800
HTTPCACHE_IGNORE_HTTP_CODES = [500, 501, 502, 503, 401, 403]
RETRY_ENABLED = True

MONGODB_HOST = quote_plus(get_env_str("MONGODB_HOST", "localhost"))
MONGODB_PORT = get_env_int("MONGODB_PORT", 27017)
MONGODB_USERNAME = quote_plus(get_env_str("MONGODB_USERNAME", ""))
MONGODB_PASSWORD = quote_plus(get_env_str("MONGODB_PASSWORD", ""))
MONGODB_AUTH_DB = get_env_str("MONGODB_AUTH_DB", "admin")
MONGODB_DB = get_env_str("MONGODB_DB", "ubertext")
MONGODB_CONNECTION_POOL_KWARGS = {}

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

@dchaplinsky, sorry, but I can’t help with any guidance here, I’ve been disconnected from Scrapy for a very long time now. Good luck!

@dchaplinsky

that’s an expected behavior?

It is not expected that website allows to continiously send ~1500+ requests per minute from single IP without IP bans or other anti-bot restrictions from server side. (this is very unique case) Usually it is not… polite to send that amount of requests with that rate. From my point of view ~1500+ requests per minute sending rate from single IP - is not slow, it is already too much.

In my case the gain from working solely from cache is less that 2x (which makes me really sad)

On nearly all of cases websites give IP bans (temp or permanent) with that requests sending rate. First thing that we usually do in this cases is to limit requests with DOWNLOAD_DELAY setting. DOWNLOAD_DELAY = 1 (~60 requests per minute) DOWNLOAD_DELAY = 0.5 (~120 requests per minute)

Solution with enabled HttpCache will work faster by 10x and more as @Gallaecio said comparing to this more… ban safe scrapy configuration limited by DOWNLOAD_DELAY setting

Is there an option to optimize/utilize more CPU (cpu cores)?

I think that In this case(~1500+ requests per minute) working performance limited by I/O bottleneck (not CPU, and not Internet connection quality). My conclusion based on performance difference between FilesystemCacheStorage and DbmCacheStorage. DbmCacheStorage - is less I/O intesnsive comparing to FilesystemCacheStorage (at least with relatively low amount of cahced data).

Or because of reactor, this is the maximum, that reactor might provide?

As far as I know (not 100% sure) usage of reactor features mosly aimed to optimize network and cpu performance (not I/O). reading/writing from/to cache is not related to reactor.