scrapy-splash: scrapy-splash recursive crawl using CrawlSpider not working

Hi !

I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:

 def process_request(self,request):
        request.meta['splash']={
            'args': {
                # set rendering arguments here
                'html': 1,
            }
        }
        return request

The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;

Thanks in advance,

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Reactions: 2
  • Comments: 36 (2 by maintainers)

Most upvoted comments

I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    ...

However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won’t have any requests to follow.

@MontaLabidi Your solution worked for me.

This is how my code looks:


class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

So i encountered this issue and solved it by overriding the type check as suggested :

def _requests_to_follow(self, response):
      if not isinstance(response, (HtmlResponse, SplashTextResponse)):
          return
      ....

but also u have to not use the SplashRequest in ur process_request method to create the new splash requests, just add splash to ur scrapy.Request meta, because the scrapy.Request returned from the _requests_to_follow method has attribute in its meta like the index of the ‘rule’ its generated by that it uses for its logic, so u dont want to generate a totally different request by using SplashRequest in ur request wrapper just add splash to the already built request like so:

def use_splash(self, request):
      request.meta.update(splash={
          'args': {
              'wait': 1,
          },
          'endpoint': 'render.html',           
      })
      return request

and add it to ur Rule : process_request="use_splash" the _requests_to_follow will apply the process_request to every built request, thats what worked for my CrawlSpider Hope that helps!

@sp-philippe-oger could you please show the whole file? In my case the crawl spider won’t call the redefined _requests_to_follow and as a consequence still stops after the first page…

Hi, I have found a workaround which works for me: Instead of using a scrapy request: yield scrapy.Request(page_url, self.parse_page) simply append this splash prefix to the url: yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page) the localhost port may depend on how you built spalsh docker

@dwj1324

I tried to debug my spider with PyCharm and set a breakpoint at if not isinstance(response, HtmlResponse):. That code was never reached when SplashRequest was used instead of scrapy.Request.

What worked for me is to add this to the callback parsing function:

def parse_item(self, response):
    """Parse response into item also create new requests."""

    page = RescrapItem()
    ...
    yield page

    if isinstance(response, (HtmlResponse, SplashTextResponse)):
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = SplashRequest(url=link.url, callback=self._response_downloaded, 
                                              args=SPLASH_RENDER_ARGS)
                r.meta.update(rule=rule, link_text=link.text)
                yield rule.process_request(r)

Following is a working crawler for scraping https://books.toscrape.com. Tested with Scrapy version 2.9.0. For installing and configuring splash, follow the README.


import scrapy
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest, SplashTextResponse, SplashJsonResponse



class FictionBookScrapper(CrawlSpider):
    _WAIT = 0.1

    name = "fiction_book_scrapper"
    allowed_domains = ['books.toscrape.com']
    start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]

    le_book_details = LinkExtractor(restrict_css=("h3 > a",))
    rule_book_details = Rule(le_book_details, callback='parse_request', follow=False, process_request='use_splash')

    le_next_page = LinkExtractor(restrict_css='.next > a')
    rule_next_page = Rule(le_next_page, follow=True, process_request='use_splash')

    rules = (
        rule_book_details,
        rule_next_page,
    )

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, args={'wait': self._WAIT}, meta={'real_url': url})

    def use_splash(self, request, response):
        request.meta['splash'] = {
            'endpoint': 'render.html',
            'args': {
                'wait': self._WAIT
            }
        }
        return request

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashTextResponse, SplashJsonResponse)):
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [
                lnk
                for lnk in rule.link_extractor.extract_links(response)
                if lnk not in seen
            ]
            for link in rule.process_links(links):
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def parse_request(self, response: scrapy.http.Response):
        self.logger.info(f'Page status code = {response.status}, url= {response.url}')

        yield {
             'Title': response.css('h1 ::text').get(),
             'Link': response.url,
             'Description': response.xpath('//*[@id="content_inner"]/article/p/text()').get()
         }


Since Scrapy 1.7.0, the process_request callback also receives a response parameter, so you need to change def use_splash(self, request): to def use_splash(self, request, response):

It does not work, throws an error use_splash() is missing 1 required positional argument: ‘response’

I had this problem too. Just use yield rule.process_request(r, response) in the last line of the overridden method

Since Scrapy 1.7.0, the process_request callback also receives a response parameter, so you need to change def use_splash(self, request): to def use_splash(self, request, response):

@sp-philippe-oger don’t worry, I actually realized my problem is with the LinkExtractor, not the scrapy/splash combo… thanks!

I use scrapy-splash and scrapy-redis

RedisCrawlSpider can running.

Need to rewrite

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse_m, endpoint='execute', dont_filter=True, args={
                'url': url, 'wait': 5, 'lua_source': default_script
            })

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def _build_request(self, rule, link):
        # parameter 'meta' is required !!!!!
        r = SplashRequest(url=link.url, callback=self._response_downloaded, meta={'rule': rule, 'link_text': link.text},
                          args={'wait': 5, 'url': link.url, 'lua_source': default_script})
        # Maybe you can delete it here.
        r.meta.update(rule=rule, link_text=link.text)
        return r

Some parameters need to be modified by themselves