scrapy: SSL website. `twisted.internet.error.ConnectionLost`

Hi everybody! I catch this error on both OS. This HTTPS site can’t be downloaded via scrapy (twisted). I looked on this issue board and I don’t found solution.

Both: Debian 9 / Mac OS

$ scrapy shell "https://wwwnet1.state.nj.us/"
2017-09-07 16:23:02 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-09-07 16:23:02 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2017-09-07 16:23:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-09-07 16:23:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-07 16:23:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-07 16:23:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-07 16:23:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-09-07 16:23:03 [scrapy.core.engine] INFO: Spider opened
2017-09-07 16:23:03 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://wwwnet1.state.nj.us/> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-09-07 16:23:03 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://wwwnet1.state.nj.us/> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-09-07 16:23:04 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://wwwnet1.state.nj.us/> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
Traceback (most recent call last):
  File "scrapy", line 11, in <module>
    sys.exit(execute())
  File "/lib/python3.5/site-packages/scrapy/cmdline.py", line 149, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/lib/python3.5/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/lib/python3.5/site-packages/scrapy/cmdline.py", line 156, in _run_command
    cmd.run(args, opts)
  File "/lib/python3.5/site-packages/scrapy/commands/shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/lib/python3.5/site-packages/scrapy/shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "/lib/python3.5/site-packages/scrapy/shell.py", line 115, in fetch
    reactor, self._schedule, request, spider)
  File "/lib/python3.5/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "/lib/python3.5/site-packages/twisted/python/failure.py", line 385, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

Mac OSx:

$ scrapy version -v
Scrapy    : 1.4.0
lxml      : 3.8.0.0
libxml2   : 2.9.4
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.9.0rc1
Python    : 3.5.1 (default, Jan 22 2016, 08:54:32) - [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)]
pyOpenSSL : 17.2.0 (OpenSSL 1.1.0f  25 May 2017)
Platform  : Darwin-16.7.0-x86_64-i386-64bit

Debian 9:

$ scrapy version -v
Scrapy    : 1.4.0
lxml      : 3.8.0.0
libxml2   : 2.9.3
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.9.0rc1
Python    : 3.4.2 (default, Oct  8 2014, 10:45:20) - [GCC 4.9.1]
pyOpenSSL : 17.2.0 (OpenSSL 1.1.0f  25 May 2017)
Platform  : Linux-3.16.0-4-amd64-x86_64-with-debian-8.7

Mac OSx:

$ openssl s_client -connect wwwnet1.state.nj.us:443 -servername wwwnet1.state.nj.us
CONNECTED(00000003)
140736760988680:error:140790E5:SSL routines:ssl23_write:ssl handshake failure:s23_lib.c:177:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 336 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID: 
    Session-ID-ctx: 
    Master-Key: 
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1504790705
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---

Debian 9:

CONNECTED(00000003)
---
Certificate chain
 0 s:/C=US/ST=New Jersey/L=Trenton/O=New Jersey State Government/OU=E-Gov Services - wwwnet1.state.nj.us/CN=wwwnet1.state.nj.us
   i:/C=US/O=Symantec Corporation/OU=Symantec Trust Network/CN=Symantec Class 3 Secure Server SHA256 SSL CA
---
Server certificate
-----BEGIN CERTIFICATE-----
<cut out>
-----END CERTIFICATE-----
<cut out>
---
No client certificate CA names sent
---
SSL handshake has read 1724 bytes and written 635 bytes
---
New, TLSv1/SSLv3, Cipher is DES-CBC3-SHA
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : DES-CBC3-SHA
    Session-ID: 930F00007F5944DC3C6010F96E95E7FA63656EF5EA35508B055078CEC249DC38
    Session-ID-ctx:
    Master-Key: 27B02D427F006A57B121CCEFEAA7F33B870DE262848BB6F851242F48F051ABB77BA4ED06706766EE8EE55F6643C9FF55
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1504790821
    Timeout   : 300 (sec)
    Verify return code: 21 (unable to verify the first certificate)
---

Thanks you for your time.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 18 (6 by maintainers)

Most upvoted comments

This worked for me:

  • force TLS 1.0
  • use cryptography<2 (e.g. 1.9 in my case, before OpenSSL 1.1)
$ scrapy version -v
Scrapy    : 1.4.0
lxml      : 3.8.0.0
libxml2   : 2.9.3
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.5.0
Python    : 3.6.2 (default, Aug 24 2017, 10:48:24) - [GCC 6.3.0 20170406]
pyOpenSSL : 17.2.0 (OpenSSL 1.0.2g  1 Mar 2016)


$ pip freeze
asn1crypto==0.22.0
attrs==17.2.0
Automat==0.6.0
cffi==1.10.0
constantly==15.1.0
cryptography==1.9
cssselect==1.0.1
hyperlink==17.3.1
idna==2.6
incremental==17.5.0
lxml==3.8.0
parsel==1.2.0
pyasn1==0.3.3
pyasn1-modules==0.1.1
pycparser==2.18
PyDispatcher==2.0.5
pyOpenSSL==17.2.0
queuelib==1.4.2
Scrapy==1.4.0
service-identity==17.0.0
six==1.10.0
Twisted==17.5.0
w3lib==1.18.0
zope.interface==4.4.2

$ scrapy shell "https://wwwnet1.state.nj.us/" -s DOWNLOADER_CLIENT_TLS_METHOD=TLSv1.0
2017-09-07 17:45:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-09-07 17:45:49 [scrapy.utils.log] INFO: Overridden settings: {'DOWNLOADER_CLIENT_TLS_METHOD': 'TLSv1.0', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-09-07 17:45:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2017-09-07 17:45:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-07 17:45:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-07 17:45:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-07 17:45:49 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-09-07 17:45:49 [scrapy.core.engine] INFO: Spider opened
2017-09-07 17:45:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://wwwnet1.state.nj.us/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f24fb802ac8>
[s]   item       {}
[s]   request    <GET https://wwwnet1.state.nj.us/>
[s]   response   <200 https://wwwnet1.state.nj.us/>
[s]   settings   <scrapy.settings.Settings object at 0x7f24f314d9e8>
[s]   spider     <DefaultSpider 'default' at 0x7f24f24ba7b8>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> 

Using OpenSSL 1.1.0f (with cryptography==2.0.3), did not work for me, even when forcing TLS1.0

I tried all the suggestions above but still didn’t manage to fix this problem. URL: https://www.diariooficial.feiradesantana.ba.gov.br/

scrapy==2.0.0
Twisted==20.3.0
pyOpenSSL==19.1.0

Any words of wisdom are much appreciated. 🙏

@anapaulagomes you have to use TLSv1.0 and RC4-MD5 cihper. The next command should work in the scraper environment curl -v --tlsv1.0 --ciphers RC4-MD5 https://www.diariooficial.feiradesantana.ba.gov.br/ You can reach it by compiling the OpenSSL with support SSLv3.

I had the same Issue, in my case the solution was to set the USER_AGENTin the seetings-pyfile:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'