scrapy: Duplicates filtering and RAM usage

Summary

I am running a broad crawl with an input of ~4 million starting URLs. I followed the suggestions for broad crawls from here and am using the JOBDIR option to persist request queues to disk. I have been running this crawl for ~1.5 months. With time, I have observed the RAM usage of the crawler increase from ~2 GB (1.5 months ago) to ~4.5 GB currently. I have already read about causes and workarounds here.

Based on my debugging, I found the main cause of this increased RAM usage to be the set of request fingerprints that are stored in memory and queried during duplicates filtering as per here.

Motivation

My non-beefy system only has 8 GB of RAM and to prevent OOM issues, I decided to write a duplicates filter that writes and queries fingerprints from a sqlite database. Below is the source code for the modified duplicates filter:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from scrapy.dupefilters import RFPDupeFilter
from scrapy.http.request import Request
from contextlib import closing
import sqlite3
import logging
import os


class RFPSQLiteDupeFilter(RFPDupeFilter):
    def __init__(self, path: str, debug: bool = False) -> None:
        self.logdupes = True
        self.debug = debug
        self.logger = logging.getLogger(__name__)
        self.schema = """
        CREATE TABLE requests_seen (fingerprint TEXT PRIMARY KEY);"""
        self.db = os.path.join(path, "requests_seen.sqlite")
        db_exists = os.path.exists(self.db)
        self.conn = sqlite3.connect(self.db)

        # conditional actions if database exists
        if not db_exists:
            with closing(self.conn.cursor()) as cursor:
                cursor.execute(self.schema)
                self.conn.commit()
            self.logger.info("Created database: %s" % self.db)
        else:
            self.logger.info(
                "Skipping database creation since it already exists: %s" %
                self.db)

    def request_seen(self, request: Request) -> bool:
        # create fingerprint
        fp = self.request_fingerprint(request)

        # assign fingerprint or produce error
        try:
            with closing(self.conn.cursor()) as cursor:
                cursor.execute(
                    """
                    INSERT INTO requests_seen VALUES (?);
                    """, (fp, ))
                self.conn.commit()
        except sqlite3.IntegrityError:
            return True
        else:
            return False

    def close(self, reason: str) -> None:
        self.conn.close()

Observations

This helped to reduce my RAM usage back to the levels I observed 1.5 months ago. Additionally, I did not observe a significant negative impact in crawling speed.

Feature request

Does it make sense to add this RFPSQLiteDupeFilter class to dupefilters.py in scrapy?

I can imagine this to be a nice feature for broad crawls on machines with limited RAM. I would be glad to submit a PR if this is of interest.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 16 (7 by maintainers)

Most upvoted comments

Another possibility is to use a key-value database, with the key being the fingerprint and the value some metadata such as time of request being enqueued.

scrapy-deltafetch uses this approach with the native dbm python library, which reduces the need for external dependencies:

https://github.com/scrapy-plugins/scrapy-deltafetch/blob/master/scrapy_deltafetch/middleware.py