scrapy: Scrapy should handle "invalid" relative URLs better
Currently Scrapy can’t extract links from http://scrapy.org/ page correctly because urls in page header are relative to a non-existing parent: ../download/
, ../doc/
, etc. Browsers resolve these links as http://scrapy.org/download/
and http://scrapy.org/doc/
, while response.urljoin, urlparse.urljoin and our ink extractors resolve them as http://scrapy.org/../download/
, etc. This results in 400 Bad Request responses.
urlparse.urljoin is not correct (or not modern) here. In the URL Living Standard for browsers it is said:
If buffer is “…”, remove url’s path’s last entry, if any, and then if c is neither “/” nor "", append the empty string to url’s path.
About this issue
- Original URL
- State: open
- Created 9 years ago
- Comments: 15 (10 by maintainers)
The most evil spider ever: looks innocent, but doesn’t work for multiple reasons