scrapy: Duplicates filtering and RAM usage
Summary
I am running a broad crawl with an input of ~4 million starting URLs. I followed the suggestions for broad crawls from here and am using the JOBDIR option to persist request queues to disk. I have been running this crawl for ~1.5 months. With time, I have observed the RAM usage of the crawler increase from ~2 GB (1.5 months ago) to ~4.5 GB currently. I have already read about causes and workarounds here.
Based on my debugging, I found the main cause of this increased RAM usage to be the set of request fingerprints that are stored in memory and queried during duplicates filtering as per here.
Motivation
My non-beefy system only has 8 GB of RAM and to prevent OOM issues, I decided to write a duplicates filter that writes and queries fingerprints from a sqlite database. Below is the source code for the modified duplicates filter:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from scrapy.dupefilters import RFPDupeFilter
from scrapy.http.request import Request
from contextlib import closing
import sqlite3
import logging
import os
class RFPSQLiteDupeFilter(RFPDupeFilter):
def __init__(self, path: str, debug: bool = False) -> None:
self.logdupes = True
self.debug = debug
self.logger = logging.getLogger(__name__)
self.schema = """
CREATE TABLE requests_seen (fingerprint TEXT PRIMARY KEY);"""
self.db = os.path.join(path, "requests_seen.sqlite")
db_exists = os.path.exists(self.db)
self.conn = sqlite3.connect(self.db)
# conditional actions if database exists
if not db_exists:
with closing(self.conn.cursor()) as cursor:
cursor.execute(self.schema)
self.conn.commit()
self.logger.info("Created database: %s" % self.db)
else:
self.logger.info(
"Skipping database creation since it already exists: %s" %
self.db)
def request_seen(self, request: Request) -> bool:
# create fingerprint
fp = self.request_fingerprint(request)
# assign fingerprint or produce error
try:
with closing(self.conn.cursor()) as cursor:
cursor.execute(
"""
INSERT INTO requests_seen VALUES (?);
""", (fp, ))
self.conn.commit()
except sqlite3.IntegrityError:
return True
else:
return False
def close(self, reason: str) -> None:
self.conn.close()
Observations
This helped to reduce my RAM usage back to the levels I observed 1.5 months ago. Additionally, I did not observe a significant negative impact in crawling speed.
Feature request
Does it make sense to add this RFPSQLiteDupeFilter class to dupefilters.py in scrapy?
I can imagine this to be a nice feature for broad crawls on machines with limited RAM. I would be glad to submit a PR if this is of interest.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 16 (7 by maintainers)
Another possibility is to use a key-value database, with the key being the fingerprint and the value some metadata such as time of request being enqueued.
scrapy-deltafetchuses this approach with the nativedbmpython library, which reduces the need for external dependencies:https://github.com/scrapy-plugins/scrapy-deltafetch/blob/master/scrapy_deltafetch/middleware.py