scrapy: Default downloader fails to get page
‘http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749’
Looks like the default downloader implemented with twisted lib can’t fetch the above url. I ran ‘scrapy shell http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749’, and got the following output.
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.17.0', 'scrapy')
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 489, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 1207, in run_script
execfile(script_filename, namespace, namespace)
File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
execute()
File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 88, in _run_print_help
func(*a, **kw)
File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/commands/shell.py", line 47, in run
shell.start(url=url, spider=spider)
File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 43, in start
self.fetch(url, spider)
File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 85, in fetch
reactor, self._schedule, request, spider)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/threads.py", line 118, in blockingCallFromThread
result.raiseException()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/python/failure.py", line 370, in raiseException
raise self.type, self.value, self.tb
twisted.internet.error.ConnectionDone: Connection was closed cleanly.
But both urlopen of urllib2 and requests.get can download the page smoothly.
About this issue
- Original URL
- State: open
- Created 11 years ago
- Reactions: 2
- Comments: 17 (8 by maintainers)
Seems like this last site sends some ASCII art with its headers:
which makes Twisted choke on this line. There is no
b":"in the received header, hence theValueError:AFAICT, these are not RFC-compliant headers: "Each header field consists of a name followed by a colon (“😊 and the field value” (RFC 2616, section 4.2).
I’ve written up a workaround here.
same error with Twisted 21.7.0: scrapy shell https://spotless.tech/
I actually had a long list of urls (around 15 000), and about ~0.5% gave this error. Once I ran it again with only the ones that gave me the error, it disappeared 😃