crawlee: Crawler not finished / not resolving

I’ve been testing Apify extensively recently and I’ve noticed a strange behavior - crawler sometimes doesn’t stop/end properly when maxRequestsPerCrawl is reached and there are still requests being processed. It just never resolves.

I put a logger into the _maybeFinish method and it goes like this:

INFO: BasicCrawler: Crawler reached the maxRequestsPerCrawl limit of 200 requests and will shut down soon. Requests that are in progress will be allowed to finish.
_maybeFinish: _currentConcurrency = 2
_maybeFinish: _currentConcurrency = 1

And then silence. After the maxRequestsPerCrawl INFO the _maybeFinish polling stops at 1 and then nothing. It doesn’t get past the await crawler.run();.

Any idea what I could be doing wrong? 🤔

Thanks!

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 58 (22 by maintainers)

Most upvoted comments

@surfshore

0.20.3-dev.2 now has graceful closing of database connections, which may help in multi process scenarios.

mnmkng on Mar 31, 2020

Hey guys, it’s been some time but we finally have a completely new RequestQueue ready for testing. It is based on the SQLite database and we would like to invite you to test it and see if it solves your problem. To get it, simply use

"apify": "dev" or "apify": "0.20.3-dev.0" in your package.json dependencies.

Thanks.

mnmkng on Mar 26, 2020

@surfshore Thanks for the additional info. I know that this issue’s been hanging here for a while, but I think we’ll be able to start working on it in the next week or so.

mnmkng on Feb 5, 2020

Hi, I have the same problem. There’s an 50%+ chance that the in my case scenario will happen. I tried debugging a little.

I think the cause is in RequestQueueLocal class, pendingCount. The finish condition is that “this.pendingCount === this.inProgressCount” at requestQueue.isEmpty(), If it did not finish, pendingCount was the value of minus.

If try to crawl 6000, it will look like this: start / pendingCount:6000 inProgressCount:0 (maybe)finish / pendingCount:-246 inProgressCount:0 _handledCount:6246 progressQueueCount:0

The code that increases pendingCount is only in addRequest(), I think the problem is not to reach that code. “requestCopy.id” does not include “this.requestIdToQueueOrderNo[]” ?

I hope I can be of any help to you.

In my case.

Ubuntu server 18.04 LTS
CheerioCrawler (apify@0.19.1)
requestQueue 6000
empty pending folder
when requestQueue finished, currentConcurrency/desiredConcurrency increases to the value of maxConcurrency.
maxRequestsPerCrawl has no effect.

surfshore on Feb 5, 2020

Just a quick follow-up: I tested the dev branch with sqllite yesterday and saw no issues, got the same result as on my previous belts-and-bracers filesystem based crawl with ~46,000 pages. It was not extensive testing yet though. Looking forward to this finding its way into master.

jektvik on Jul 26, 2020

@surfshore I created a new issue to track this problem with stealth. Closing this one, as it seems that the RequestQueue related issues were solved. Barring the multi-process usage.

mnmkng on Apr 2, 2020

@cspeer 0.20.3-dev.1 is out. Please let me know if it fixes your problem.

mnmkng on Mar 31, 2020

Ok, got it now. Thanks. It’s most likely a bug in caching of request queue instances. Will fix today and release a new version.

mnmkng on Mar 31, 2020

@surfshore Regarding the “database is locked” problem: It seems that this is caused by multiple running instances of the crawler accessing the same sqlite database file. I successfully worked around that by having my code create separate directories for each instance of the crawler and then setting process.env.APIFY_LOCAL_STORAGE_DIR so that every crawler has its own database.

cspeer on Mar 31, 2020