scrapy: Default downloader fails to get page

http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749

Looks like the default downloader implemented with twisted lib can’t fetch the above url. I ran ‘scrapy shell http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749’, and got the following output.

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.17.0', 'scrapy')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 1207, in run_script
    execfile(script_filename, namespace, namespace)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/commands/shell.py", line 47, in run
    shell.start(url=url, spider=spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 43, in start
    self.fetch(url, spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 85, in fetch
    reactor, self._schedule, request, spider)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/threads.py", line 118, in blockingCallFromThread
    result.raiseException()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/python/failure.py", line 370, in raiseException
    raise self.type, self.value, self.tb
twisted.internet.error.ConnectionDone: Connection was closed cleanly.

But both urlopen of urllib2 and requests.get can download the page smoothly.

About this issue

  • Original URL
  • State: open
  • Created 11 years ago
  • Reactions: 2
  • Comments: 17 (8 by maintainers)

Most upvoted comments

Seems like this last site sends some ASCII art with its headers:

$ curl -I https://spotless.tech
HTTP/1.1 200 sP0tL3sS sP0tlLesS (╯°□°)╯︵ ┻━┻
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
░░░░░░░░░░▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄░░░░░░░░░
░░░░░░░░▄▀░░░░░░░░░░░░▄░░░░░░░▀▄░░░░░░░
░░░░░░░░█░░▄░░░░▄░░░░░░░░░░░░░░█░░░░░░░
░░░░░░░░█░░░░░░░░░░░░▄█▄▄░░▄░░░█░▄▄▄░░░
░▄▄▄▄▄░░█░░░░░░▀░░░░▀█░░▀▄░░░░░█▀▀░██░░
░██▄▀██▄█░░░▄░░░░░░░██░░░░▀▀▀▀▀░░░░██░░
░░▀██▄▀██░░░░░░░░▀░██▀░░░░░░░░░░░░░▀██░
░░░░▀████░▀░░░░▄░░░██░░░▄█░░░░▄░▄█░░██░
░░░░░░░▀█░░░░▄░░░░░██░░░░▄░░░▄░░▄░░░██░
░░░░░░░▄█▄░░░░░░░░░░░▀▄░░▀▀▀▀▀▀▀▀░░▄▀░░
░░░░░░█▀▀█████████▀▀▀▀████████████▀░░░░
░░░░░░████▀░░███▀░░░░░░▀███░░▀██▀░░░░░░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Server: Sp0tw3b
Date: Tue, 26 Oct 2021 12:07:07 GMT
Content-Type: text/html
Content-Length: 33015
Connection: keep-alive
Last-Modified: Tuesday, 26-Oct-2021 12:07:07 GMT
Cache-Control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0
Accept-Ranges: bytes

which makes Twisted choke on this line. There is no b":" in the received header, hence the ValueError:

>>> a, b = b"foobar".split(b":", 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected 2, got 1)

AFAICT, these are not RFC-compliant headers: "Each header field consists of a name followed by a colon (“😊 and the field value” (RFC 2616, section 4.2).

I’ve written up a workaround here.

same error with Twisted 21.7.0: scrapy shell https://spotless.tech/

I actually had a long list of urls (around 15 000), and about ~0.5% gave this error. Once I ran it again with only the ones that gave me the error, it disappeared 😃