scrapy-splash: scrapy-splash recursive crawl using CrawlSpider not working
Hi !
I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:
def process_request(self,request):
request.meta['splash']={
'args': {
# set rendering arguments here
'html': 1,
}
}
return request
The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;
Thanks in advance,
About this issue
- Original URL
- State: open
- Created 8 years ago
- Reactions: 2
- Comments: 36 (2 by maintainers)
I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function
However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won’t have any requests to follow.
@MontaLabidi Your solution worked for me.
This is how my code looks:
This works perfectly for me.
So i encountered this issue and solved it by overriding the type check as suggested :
but also u have to not use the SplashRequest in ur process_request method to create the new splash requests, just add splash to ur scrapy.Request meta, because the scrapy.Request returned from the _requests_to_follow method has attribute in its meta like the index of the ‘rule’ its generated by that it uses for its logic, so u dont want to generate a totally different request by using SplashRequest in ur request wrapper just add splash to the already built request like so:
and add it to ur Rule :
process_request="use_splash"the _requests_to_follow will apply the process_request to every built request, thats what worked for my CrawlSpider Hope that helps!@sp-philippe-oger could you please show the whole file? In my case the crawl spider won’t call the redefined _requests_to_follow and as a consequence still stops after the first page…
Hi, I have found a workaround which works for me: Instead of using a scrapy request:
yield scrapy.Request(page_url, self.parse_page)simply append this splash prefix to the url:yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page)the localhost port may depend on how you built spalsh docker@dwj1324
I tried to debug my spider with PyCharm and set a breakpoint at
if not isinstance(response, HtmlResponse):. That code was never reached whenSplashRequestwas used instead ofscrapy.Request.What worked for me is to add this to the callback parsing function:
Following is a working crawler for scraping
https://books.toscrape.com. Tested with Scrapy version2.9.0. For installing and configuring splash, follow the README.I had this problem too. Just use
yield rule.process_request(r, response)in the last line of the overridden methodSince Scrapy 1.7.0, the
process_requestcallback also receives aresponseparameter, so you need to changedef use_splash(self, request):todef use_splash(self, request, response):@sp-philippe-oger don’t worry, I actually realized my problem is with the LinkExtractor, not the scrapy/splash combo… thanks!
I use scrapy-splash and scrapy-redis
RedisCrawlSpider can running.
Need to rewrite
Some parameters need to be modified by themselves