crawlee: Crawler not finished / not resolving

I’ve been testing Apify extensively recently and I’ve noticed a strange behavior - crawler sometimes doesn’t stop/end properly when maxRequestsPerCrawl is reached and there are still requests being processed. It just never resolves.

I put a logger into the _maybeFinish method and it goes like this:

INFO: BasicCrawler: Crawler reached the maxRequestsPerCrawl limit of 200 requests and will shut down soon. Requests that are in progress will be allowed to finish.
_maybeFinish: _currentConcurrency = 2
_maybeFinish: _currentConcurrency = 1

And then silence. After the maxRequestsPerCrawl INFO the _maybeFinish polling stops at 1 and then nothing. It doesn’t get past the await crawler.run();.

Any idea what I could be doing wrong? 🤔

Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 58 (22 by maintainers)

Most upvoted comments

@surfshore

0.20.3-dev.2 now has graceful closing of database connections, which may help in multi process scenarios.

Hey guys, it’s been some time but we finally have a completely new RequestQueue ready for testing. It is based on the SQLite database and we would like to invite you to test it and see if it solves your problem. To get it, simply use

"apify": "dev" or "apify": "0.20.3-dev.0" in your package.json dependencies.

Thanks.

@surfshore Thanks for the additional info. I know that this issue’s been hanging here for a while, but I think we’ll be able to start working on it in the next week or so.

Hi, I have the same problem. There’s an 50%+ chance that the in my case scenario will happen. I tried debugging a little.

I think the cause is in RequestQueueLocal class, pendingCount. The finish condition is that “this.pendingCount === this.inProgressCount” at requestQueue.isEmpty(), If it did not finish, pendingCount was the value of minus.

If try to crawl 6000, it will look like this: start / pendingCount:6000 inProgressCount:0 (maybe)finish / pendingCount:-246 inProgressCount:0 _handledCount:6246 progressQueueCount:0

The code that increases pendingCount is only in addRequest(), I think the problem is not to reach that code. “requestCopy.id” does not include “this.requestIdToQueueOrderNo[]” ?

I hope I can be of any help to you.

In my case.

  • Ubuntu server 18.04 LTS
  • CheerioCrawler (apify@0.19.1)
  • requestQueue 6000
  • empty pending folder
  • when requestQueue finished, currentConcurrency/desiredConcurrency increases to the value of maxConcurrency.
  • maxRequestsPerCrawl has no effect.

Just a quick follow-up: I tested the dev branch with sqllite yesterday and saw no issues, got the same result as on my previous belts-and-bracers filesystem based crawl with ~46,000 pages. It was not extensive testing yet though. Looking forward to this finding its way into master.

@surfshore I created a new issue to track this problem with stealth. Closing this one, as it seems that the RequestQueue related issues were solved. Barring the multi-process usage.

@cspeer 0.20.3-dev.1 is out. Please let me know if it fixes your problem.

Ok, got it now. Thanks. It’s most likely a bug in caching of request queue instances. Will fix today and release a new version.

@surfshore Regarding the “database is locked” problem: It seems that this is caused by multiple running instances of the crawler accessing the same sqlite database file. I successfully worked around that by having my code create separate directories for each instance of the crawler and then setting process.env.APIFY_LOCAL_STORAGE_DIR so that every crawler has its own database.