scrapyscript: Issue with "return self.results.get()" in "Processor().run()" causing processing to hang forever

Thank you for writing scrapyscript; it’s been very helpful!

However, I have a script that looks something like the below, written in Python 3.5. I noticed that when I run Process().run() as below, the spider runs to completion, but hangs after the scrape is done (which causes my Celery 4 jobs to hang and never finish).

I verified that in the scrapyscript code, in the run() method, processing makes it through p.start(), p.join(), p.terminate(), but hangs on the return statement. I noticed that the return statement is looking up results in a Queue. If I comment out the return statement (I personally don’t care about the returned results), the processing finishes.

from scrapy.utils.project import get_project_settings

from scrapyscript import Job, Processor
from myproject.spiders.myspider import MySpider

scraper_args = dict(arg1="1", arg2="2")

config = get_project_settings()
spider = MySpider(**scraper_args)
# The api for scrapyscript requires us to pass in the
# scraper_args a second time. The constructor for the spider
# is called twice: once above and once again in the Job.
job = Job(spider, payload=scraper_args)
Processor(settings=config).run(job)

Impacted area: https://github.com/jschnurr/scrapyscript/blob/master/scrapyscript.py#L118

(As an aside, I also overrode the init() method in my spider to accept arguments, but noticed that I have to pass in the custom arguments to both MySpider and as the payload, but maybe I should open a different ticket for that?)

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 15 (8 by maintainers)

Most upvoted comments

@christosmito I thought I had provided a detailed explanation already 😃. There are a few ways to fix this according to the python documentation so I didn’t provide a pull request as @vidakDK may want to fix it in a different way.

The main issue is that all the items need to be removed from the queue before the process gets joined otherwise the code deadlocks. So the run method in __init__.py should actually read like this instead:

def run(self, jobs):
        """Start the Scrapy engine, and execute all jobs.  Return consolidated results
        in a single list.
        Parms:
          jobs ([Job]) - one or more Job objects to be processed.
        Returns:
          List of objects yielded by the spiders after all jobs have run.
        """
        if not isinstance(jobs, collections.abc.Iterable):
            jobs = [jobs]
        self.validate(jobs)

        p = Process(target=self._crawl, args=[jobs])
        p.start()
        res = self.results.get()  # This line has moved from the bottom of the method. This is the only change.
        p.join()
        p.terminate()

        return res

Take a copy of the code, try this change and see if it works in your scenario also. It worked for me with more complex scrapy Items (and even with simpler dicts) and a lot of data.

covuworie on Dec 2, 2021

@jschnurr Well Scrapyscript combines the results of multiple calls into a single return object, which I then wanted to store in the database, in order to split up the scraping and database actions in the code.

However, Scrapyscript was unable to end jobs when its response was tied to outside objects, so I added database storing logic to each call, and removed all data from “return” option of Scrapyscript.

I would either try to edit the logic of those twisted calls to be able to make some deep copies of data which would enable it to kill the process normally, or just remove the return functionality.

Update: I don’t have the time to test this now, but the problem could be related to using Scrapy pipelines. If we test the example from the project readme file, but instead add some basic code in pipelines.py that returns the data instead of the spider itself, do we get the hanging process problem or does it work normally?

vidakDK on May 22, 2018

zknicker on Nov 13, 2017