scrapy: scrapy.item.Field memory leak.
Description
- I generate a CrawlerRunner as the official docs suggest (https://docs.scrapy.org/en/latest/topics/practices.html?highlight=in%20script#run-scrapy-from-a-script), On this basis, I achieve schedule tasks by reading the scheduled.yml file by the following codes.
- I usually use
yield dict_to_item("ItemClass", {**data})
(implement by the following codes) to convert data_dict to item, In the pipeline, data will inserted into different mongo collections by differentitem.__class__.__name__
- when I run
python schedule.py
, it takes 2 GB of memory after one day. Then my docker kills it. - I use muppy to check what was hogging the memory, got the following results, It seems that only scrapy.item.Field is not released, so I ruled out the possibility of a memory leak in the Spider class.
- I checked all documentations and stackoverflow, I didn’t find similar question, so I try to get answer here. please let me know if my formulation confuses you or if I have made some low-level mistakes in my code. At the moment I think there may be something wrong with the way I implement the scheduled job, or something wrong with dict_to_item(), And I can’t rule out that it may be a python garbage collection problem or a scrapy bug.
def debug_memory_leak(): logging.warning("=" * 50 + "MEMORY LEAK DEBUG START" + "=" * 50) all_objects = muppy.get_objects() suml = summary.summarize(all_objects) summary.print_(suml) logging.warning("=" * 50 + "MEMORY LEAK DEBUG END" + "=" * 50) schedule_from_yml() LoopingCall(debug_memory_lake).start(60 * 30) #RESULTS( after half of one day running): types | # objects | total size scrapy.item.Field | 824855 | 195.09 MB dict | 352841 | 126.81 MB scrapy.item.ItemMeta | 117544 | 119.27 MB frozenset | 118478 | 25.48 MB tuple | 320417 | 19.94 MB set | 59906 | 12.59 MB weakref | 182900 | 12.56 MB str | 77587 | 9.03 MB function (remove) | 59033 | 7.66 MB _abc_data | 117928 | 5.40 MB code | 22258 | 3.98 MB int | 126113 | 3.84 MB type | 3954 | 3.66 MB list | 61629 | 3.63 MB weakref.WeakKeyDictionary | 58999 | 2.70 MB
# schedule.py
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
from croniter import croniter
import datetime as dt
import logging, yaml, sys, os
configure_logging(get_project_settings())
runner = CrawlerRunner(get_project_settings())
def schedule_next_crawl_cron(null, spider, expression):
now = dt.datetime.now()
_next = croniter(expression, now).get_next(dt.datetime)
sleep_time = int(_next.timestamp() - now.timestamp())
logging.warning('<{}> cron -{}- next run at {} (after {} seconds).'.format(spider, expression, _next, sleep_time))
reactor.callLater(sleep_time, crawl_cron, spider, expression, True)
def crawl_job(spider, run_at_scheduled, kwargs):
if run_at_scheduled:
return runner.crawl(spider, **kwargs)
else:
return defer.succeed([])
def crawl_cron(spider, expression, run_at_scheduled=False, kwargs=None):
if kwargs is None:
kwargs = {}
d = crawl_job(spider, run_at_scheduled, kwargs)
d.addCallback(schedule_next_crawl_cron, spider, expression)
def schedule_from_yml(file='scheduled.yml'):
file = sys.path[0] + '/' + file
with open(file) as f:
scheduled = yaml.safe_load(f).get('scheduled', [])
_allowed_method = ['interval', 'at_time', 'cron']
for s in scheduled:
logging.warning('add schedule {} from {}'.format(s, file))
if s['method'] in _allowed_method:
globals()['crawl_' + s.pop('method')](**s)
schedule_from_yml()
reactor.run()
# item.py
def dict_to_item(class_name: str, dictionary: dict):
item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
return item_cls(**dictionary)
Additional
- I’ve tried setting JOB_DIR and run schedule.py, but nothings changed.
- At first I thought it was caused by the repeated definition of scrapy.Field(), so I changed dict_to_item() to cache the defined item_cls to a setting dict,it’s like explicitly define an item, but it not works.
def dict_to_item(class_name: str, dictionary: dict): cache = get_project_settings().get("DICT_TO_ITEM_CACHE", None) if cache: if cache.get(class_name,None): item_cls = cache[class_name] else: item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()}) cache[class_name] = item_cls return item_cls().update(dictionary) return type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})().update(dictionary)
- I know perhaps I should explicitly define an item. but I can not sure every field_names in some my spiders, in this case , I will use dict_to_item(). I have only one pipeline to process all crawler data, This pipeline inserts data into different mongo collections by item.class.name( for example inserting ApplePieItem into apple_pies table).
Versions
python 3.8 scrapy 2.7.0 / 2.5.1 (tried both)
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 17 (9 by maintainers)
@BillYu811 This lines from traceback clearly indicates that this leak caused by Your custom code (not scrapy code I don’t see any kind of bugs, or unexpected behaviour from scrapy code side):
In general case we expect that one or several
scrapy.Item
classes with predefined(fixed) set ofscrapy.Field
exist in scrapy project.I don’t remember anything like dynamic item (in terms that multiple
scrapy.Item
classes generated with dynamically(during scrapy runtime) definedscrapy.Field
names) mentioned on documentation.I don’t think that You need to return items as
scrapy.Item
class objects. Did You try to return… justdict
objects as items (without convertation toscrapy.Item
) ?Inside
dict_to_item
Your application dynamically increase size of customDICT_TO_ITEM_CACHE
setting.And Running multiple spiders in single process using
CrawlerRunner
means that on each new scheduled spider will use… the same instance ofDICT_TO_ITEM_CACHE
setting (it will be bigger on each new scheduled spider as it may contain cached values from previous runs). I’d recommend to somehow periodically log/track… size ofDICT_TO_ITEM_CACHE
setting during runtime.