scrapy: scrapy.item.Field memory leak.

Description

  • I generate a CrawlerRunner as the official docs suggest (https://docs.scrapy.org/en/latest/topics/practices.html?highlight=in%20script#run-scrapy-from-a-script), On this basis, I achieve schedule tasks by reading the scheduled.yml file by the following codes.
  • I usually use yield dict_to_item("ItemClass", {**data}) (implement by the following codes) to convert data_dict to item, In the pipeline, data will inserted into different mongo collections by different item.__class__.__name__
  • when I run python schedule.py, it takes 2 GB of memory after one day. Then my docker kills it.
  • I use muppy to check what was hogging the memory, got the following results, It seems that only scrapy.item.Field is not released, so I ruled out the possibility of a memory leak in the Spider class.
  • I checked all documentations and stackoverflow, I didn’t find similar question, so I try to get answer here. please let me know if my formulation confuses you or if I have made some low-level mistakes in my code. At the moment I think there may be something wrong with the way I implement the scheduled job, or something wrong with dict_to_item(), And I can’t rule out that it may be a python garbage collection problem or a scrapy bug.
    def debug_memory_leak():
        logging.warning("=" * 50 + "MEMORY LEAK DEBUG START" + "=" * 50)
        all_objects = muppy.get_objects()
        suml = summary.summarize(all_objects)
        summary.print_(suml)
        logging.warning("=" * 50 + "MEMORY LEAK DEBUG END" + "=" * 50)
    
    schedule_from_yml()
    LoopingCall(debug_memory_lake).start(60 * 30)
    
    #RESULTS( after half of one day running):
                      types |   # objects |   total size
          scrapy.item.Field |      824855 |    195.09 MB
                       dict |      352841 |    126.81 MB
       scrapy.item.ItemMeta |      117544 |    119.27 MB
                  frozenset |      118478 |     25.48 MB
                      tuple |      320417 |     19.94 MB
                        set |       59906 |     12.59 MB
                    weakref |      182900 |     12.56 MB
                        str |       77587 |      9.03 MB
          function (remove) |       59033 |      7.66 MB
                  _abc_data |      117928 |      5.40 MB
                       code |       22258 |      3.98 MB
                        int |      126113 |      3.84 MB
                       type |        3954 |      3.66 MB
                       list |       61629 |      3.63 MB
    weakref.WeakKeyDictionary |       58999 |      2.70 MB
    
    
    
# schedule.py
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
from croniter import croniter
import datetime as dt
import logging, yaml, sys, os

configure_logging(get_project_settings())
runner = CrawlerRunner(get_project_settings())


def schedule_next_crawl_cron(null, spider, expression):
    now = dt.datetime.now()
    _next = croniter(expression, now).get_next(dt.datetime)
    sleep_time = int(_next.timestamp() - now.timestamp())
    logging.warning('<{}> cron -{}- next run at {} (after {} seconds).'.format(spider, expression, _next, sleep_time))
    reactor.callLater(sleep_time, crawl_cron, spider, expression, True)


def crawl_job(spider, run_at_scheduled, kwargs):
    if run_at_scheduled:
        return runner.crawl(spider, **kwargs)
    else:
        return defer.succeed([])

def crawl_cron(spider, expression, run_at_scheduled=False, kwargs=None):
    if kwargs is None:
        kwargs = {}
    d = crawl_job(spider, run_at_scheduled, kwargs)
    d.addCallback(schedule_next_crawl_cron, spider, expression)

def schedule_from_yml(file='scheduled.yml'):
    file = sys.path[0] + '/' + file
    with open(file) as f:
        scheduled = yaml.safe_load(f).get('scheduled', [])
    _allowed_method = ['interval', 'at_time', 'cron']
    for s in scheduled:
        logging.warning('add schedule {} from {}'.format(s, file))
        if s['method'] in _allowed_method:
            globals()['crawl_' + s.pop('method')](**s)


schedule_from_yml()
reactor.run()
# item.py
def dict_to_item(class_name: str, dictionary: dict):
    item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
    return item_cls(**dictionary)

Additional

  • I’ve tried setting JOB_DIR and run schedule.py, but nothings changed.
  • At first I thought it was caused by the repeated definition of scrapy.Field(), so I changed dict_to_item() to cache the defined item_cls to a setting dict,it’s like explicitly define an item, but it not works.
    def dict_to_item(class_name: str, dictionary: dict):
        cache = get_project_settings().get("DICT_TO_ITEM_CACHE", None)
        if cache:
            if cache.get(class_name,None):
                item_cls = cache[class_name]
            else:
                item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
                cache[class_name] = item_cls
            return item_cls().update(dictionary)
        return type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})().update(dictionary)
    
  • I know perhaps I should explicitly define an item. but I can not sure every field_names in some my spiders, in this case , I will use dict_to_item(). I have only one pipeline to process all crawler data, This pipeline inserts data into different mongo collections by item.class.name( for example inserting ApplePieItem into apple_pies table).

Versions

python 3.8 scrapy 2.7.0 / 2.5.1 (tried both)

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

@BillYu811 This lines from traceback clearly indicates that this leak caused by Your custom code (not scrapy code I don’t see any kind of bugs, or unexpected behaviour from scrapy code side):

  >   File "/app/crawlers/crawlers/spiders/my_spider.py", line 45
  >     yield dict_to_item("MyItem", {
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})

In general case we expect that one or several scrapy.Item classes with predefined(fixed) set of scrapy.Field exist in scrapy project.

I don’t remember anything like dynamic item (in terms that multiple scrapy.Item classes generated with dynamically(during scrapy runtime) definedscrapy.Field names) mentioned on documentation.

I usually use yield dict_to_item(“ItemClass”, {**data}) (implement by the following codes) to convert data_dict to item, In the pipeline, data will inserted into different mongo collections by different item.class.name

I don’t think that You need to return items as scrapy.Item class objects. Did You try to return… just dict objects as items (without convertation to scrapy.Item) ?

def dict_to_item(class_name: str, dictionary: dict):
    cache = get_project_settings().get("DICT_TO_ITEM_CACHE", None) # <may shared between multiple spider runs.
    if cache:
        if cache.get(class_name,None):
            item_cls = cache[class_name]
        else:
            item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
            cache[class_name] = item_cls
        return item_cls().update(dictionary)
    return type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})().update(dictionary)

Inside dict_to_item Your application dynamically increase size of custom DICT_TO_ITEM_CACHE setting.

And Running multiple spiders in single process using CrawlerRunner means that on each new scheduled spider will use… the same instance of DICT_TO_ITEM_CACHE setting (it will be bigger on each new scheduled spider as it may contain cached values from previous runs). I’d recommend to somehow periodically log/track… size of DICT_TO_ITEM_CACHE setting during runtime.