scrapy: Scrapy should handle "invalid" relative URLs better

Currently Scrapy can’t extract links from http://scrapy.org/ page correctly because urls in page header are relative to a non-existing parent: ../download/, ../doc/, etc. Browsers resolve these links as http://scrapy.org/download/ and http://scrapy.org/doc/, while response.urljoin, urlparse.urljoin and our ink extractors resolve them as http://scrapy.org/../download/, etc. This results in 400 Bad Request responses.

urlparse.urljoin is not correct (or not modern) here. In the URL Living Standard for browsers it is said:

If buffer is “…”, remove url’s path’s last entry, if any, and then if c is neither “/” nor "", append the empty string to url’s path.

About this issue

  • Original URL
  • State: open
  • Created 9 years ago
  • Comments: 15 (10 by maintainers)

Most upvoted comments

The most evil spider ever: looks innocent, but doesn’t work for multiple reasons

import scrapy

class ScrapySpider(scrapy.Spider):
    name = 'scrapyspider'

    def start_requests(self):
        yield scrapy.Request("http://scrapy.org", self.parse_main)

    def parse_main(self, response):
        for href in response.xpath("//a/@href").extract():
            yield scrapy.Request(response.urljoin(href), self.parse_link)

    def parse_link(self, response):
        print(response.url)