cheerio: out of memory when using cheerio in crawler

Hi, I’m using cheerio to parse html page in a simple crawler as below, the system quickly go out of memory when processing tens of pages, my computer has more than 4GB free memory, I notice that cheerio has a load operation, do I need to unload the page explicitly or some how to let cheerio release the memory after the processing finish?

var cheerio = require('cheerio');
var request = require('request');


function parseSpecificRoom(url)
{
    request({uri: url}, function(err, resp, body) {
        var $ = cheerio.load(body);
        var price = $('.house-price').text();
        var pay = $('.pay-method').text();
        var type = $('.house-type').text().replace(/\s/g, '');
        var location = $('.xiaoqu').text().replace(/\s/g, '');
        var phone = $('.tel-num').text().replace(/\s/g, '');
        console.log(price + ', ' + pay + ', ' + type + ', ' + location + ', ' + phone)
    });
}

function parsePage(index)
{
    request({uri:'http://sz.58.com/chuzu/pn' + index}, function(err, resp, body) {
        var $ = cheerio.load(body);
        var zufang = $('#infolist').children('table').eq(1).children('tr');
        zufang.each(function(i, elem) {
            var url = $(this).children().eq(1).children().eq(0).attr('href');
            parseSpecificRoom(url)
        });
    });
}

for(var i = 1; i < 100; i++)
    parsePage(i);

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 19 (6 by maintainers)

Most upvoted comments

Actually, I ran into the “out of memory” error recently and tried to replace cheerio with whacko, which solved my problem. thanks a lot, but will cheerio solve this problem in the future?

@mike442144 I’m working on this

@pps83 Thanks for mentioning whacko. I replaced the import (since they have the same api) and it works. Memory usage is stable.

@luanmuniz Any update?

cheerio is a giant mem leak. “implementation of core jQuery designed specifically for the server” - something like that absolutely cannot be used on a server. Simply download any youtube page and try to load and run couple of selectors in a loop. Ram will balloon to gigs. @fb55 , I tried the whacko, and it doesn’t have memory issues in my case.

Well, for me I did not keep any copy of the strings after one request is complete. It seems that in my case it really is due to the gc not running since V8 didn’t account for the system memory limit (system limit is at 500MB but V8 uses 1.4GB by default iirc), therefore a simple --max_old_space_size parameter to node seems to have “resolved” it.