scrapyscript: Issue with "return self.results.get()" in "Processor().run()" causing processing to hang forever
Thank you for writing scrapyscript; it’s been very helpful!
However, I have a script that looks something like the below, written in Python 3.5. I noticed that when I run Process().run() as below, the spider runs to completion, but hangs after the scrape is done (which causes my Celery 4 jobs to hang and never finish).
I verified that in the scrapyscript code, in the run() method, processing makes it through p.start(), p.join(), p.terminate(), but hangs on the return statement. I noticed that the return statement is looking up results in a Queue. If I comment out the return statement (I personally don’t care about the returned results), the processing finishes.
from scrapy.utils.project import get_project_settings
from scrapyscript import Job, Processor
from myproject.spiders.myspider import MySpider
scraper_args = dict(arg1="1", arg2="2")
config = get_project_settings()
spider = MySpider(**scraper_args)
# The api for scrapyscript requires us to pass in the
# scraper_args a second time. The constructor for the spider
# is called twice: once above and once again in the Job.
job = Job(spider, payload=scraper_args)
Processor(settings=config).run(job)
Impacted area: https://github.com/jschnurr/scrapyscript/blob/master/scrapyscript.py#L118
(As an aside, I also overrode the init() method in my spider to accept arguments, but noticed that I have to pass in the custom arguments to both MySpider and as the payload, but maybe I should open a different ticket for that?)
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 15 (8 by maintainers)
@christosmito I thought I had provided a detailed explanation already 😃. There are a few ways to fix this according to the python documentation so I didn’t provide a pull request as @vidakDK may want to fix it in a different way.
The main issue is that all the items need to be removed from the queue before the process gets joined otherwise the code deadlocks. So the
run
method in__init__.py
should actually read like this instead:Take a copy of the code, try this change and see if it works in your scenario also. It worked for me with more complex scrapy Items (and even with simpler dicts) and a lot of data.
@jschnurr Well Scrapyscript combines the results of multiple calls into a single return object, which I then wanted to store in the database, in order to split up the scraping and database actions in the code.
However, Scrapyscript was unable to end jobs when its response was tied to outside objects, so I added database storing logic to each call, and removed all data from “return” option of Scrapyscript.
I would either try to edit the logic of those twisted calls to be able to make some deep copies of data which would enable it to kill the process normally, or just remove the
return
functionality.Update: I don’t have the time to test this now, but the problem could be related to using Scrapy pipelines. If we test the example from the project readme file, but instead add some basic code in
pipelines.py
that returns the data instead of the spider itself, do we get the hanging process problem or does it work normally?+1