scrapy: [bug?] while True in start_requests(self): make scrapy is unable to consume the yields.

I’m doing

    def start_requests(self):
        while 1:
            words = read_a_list_wanna_crawl()
            ips = get_a_ip_list()
            if words.count() > 0:
                for _, __ in zip(words, ips):
                    print('do while')
                    yield scrapy.Request(processed_url, self.html_parse, meta={'proxy': ip, ...})

but when len(zip(words, ips)) == 1, scrapy print do while forever(Infinite loop) and never download any requests. but if len(zip(words, ips)) > 1, scrapy will not go in to infinite loop.

is this a bug? can scrapy handle this?

ps: (another way to solve this) Is it able to create a fake scrapy.Request() that don’t do request but do the callback to finish this kind control flow in scrapy?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 18 (11 by maintainers)

Most upvoted comments

A good asynchronous solution is to use the spider_idle signal to schedule in batches:

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_idle, signals.spider_idle)
        return spider

    def start_requests(self):
        yield from self.batch()

    def spider_idle(self):
        if self.done():
            return

        for req in self.batch():
            self.crawler.engine.schedule(req, self)

        raise DontCloseSpider

when you yield a request, scrapy will put the request object into a schedule pool, and the scheduler will do this request concurrently when there is enough request objects or some tiny time is up.