scrapy: Logging can't format stack trace with non-ascii chars on Python 2

Hi,

I experience the same issue as described in #1602. However, I’m not using Django.

The stats look like this:

{'downloader/request_bytes': 47621,
 'downloader/request_count': 103,
 'downloader/request_method_count/GET': 103,
 'downloader/response_bytes': 1162618,
 'downloader/response_count': 103,
 'downloader/response_status_count/200': 101,
 'downloader/response_status_count/302': 2,
 'dupefilter/filtered': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 9, 2, 3, 15, 748633),
 'httpcache/firsthand': 72,
 'httpcache/hit': 31,
 'httpcache/miss': 72,
 'httpcache/store': 72,
 'item_scraped_count': 48,
 'log_count/DEBUG': 215,
 'log_count/ERROR': 1,
 'log_count/INFO': 9,
 'memusage/max': 121434112,
 'memusage/startup': 69783552,
 'mongodb/item_stored_count': 48,
 'request_depth_max': 3,
 'response_received_count': 101,
 'scheduler/dequeued': 102,
 'scheduler/dequeued/memory': 102,
 'scheduler/enqueued': 102,
 'scheduler/enqueued/memory': 102,
 'spider_exceptions/AttributeError': 1,
 'start_time': datetime.datetime(2018, 3, 9, 2, 0, 52, 510449)}

But there’s no mention about AttributeError (or any other error) in the log.

I’m executing the spiders on Scrapyd instances running in Docker container. This is the first two lines of the log showing components’ versions:

2018-03-09 02:00:52 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: realestate)
2018-03-09 02:00:52 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.3.1, w3lib 1.18.0, Twisted 17.9.0, Python 2.7.12 (default, Nov 20 2017, 18:23:56) - [GCC 5.4.0 20160609], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Linux-4.4.0-103-generic-x86_64-with-Ubuntu-16.04-xenial

There’s nothing special in the settings.py, but I can provide it if needed.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

Thanks for sharing the code that causes the issue @tlinhart

I managed to reproduce it with the following minimal spider:

# -*- coding: utf-8 -*-
import scrapy


class Spider(scrapy.Spider):
    name = 'bug3161'
    start_urls = ["https://scrapy.org"]

    def parse(self, response):
        u'příprav' in None.lower()
$ scrapy runspider myspider.py --logfile foo.log
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg.encode("UTF-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 404: ordinal not in range(128)
Logged from file scraper.py, line 158
$ scrapy runspider myspider.py
2018-03-16 15:54:21 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-16 15:54:21 [scrapy.utils.log] INFO: Versions: lxml 4.2.0.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.10 (default, Jul 15 2017, 17:16:57) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Darwin-17.4.0-x86_64-i386-64bit
2018-03-16 15:54:21 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-03-16 15:54:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-03-16 15:54:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-16 15:54:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-16 15:54:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-16 15:54:21 [scrapy.core.engine] INFO: Spider opened
2018-03-16 15:54:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-16 15:54:21 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-16 15:54:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapy.org> (referer: None)
2018-03-16 15:54:21 [scrapy.core.scraper] ERROR: Spider error processing <GET https://scrapy.org> (referer: None)
Traceback (most recent call last):
  File "/Users/daniel/envs/test-scrapy-bug-JRmRrwNr/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/daniel/myspider.py", line 12, in parse
    u'příprav' in None.lower()
AttributeError: 'NoneType' object has no attribute 'lower'
2018-03-16 15:54:21 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-16 15:54:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 209,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 14583,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 16, 18, 54, 21, 734389),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'memusage/max': 48529408,
 'memusage/startup': 48529408,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/AttributeError': 1,
 'start_time': datetime.datetime(2018, 3, 16, 18, 54, 21, 180960)}
2018-03-16 15:54:21 [scrapy.core.engine] INFO: Spider closed (finished)

@ayushmankoul if you want to help, above is a starting point. thanks

Hey, @cathalgarvey @kmike I am interested in solving this bug.Please help me to fix this bug.Any guidance would be surely helpful to me.Thank You

@cathalgarvey I understand your point, of course, don’t take my upper note too seriously 😃

Hi @ayushmankoul , I only had to add # -*- coding: utf-8 -*- at the top of your example, but other than that, it fixed the issue.

As a note, it is only reproducible on Python 2.

It works fine under Python 3.6.4

~$ scrapy runspider myspider.py --logfile py3.log
~$ cat py3.log
2018-03-16 19:17:24 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-16 19:17:24 [scrapy.utils.log] INFO: Versions: lxml 4.2.0.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.4 (default, Mar  9 2018, 23:15:03) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Darwin-17.4.0-x86_64-i386-64bit
2018-03-16 19:17:24 [scrapy.crawler] INFO: Overridden settings: {'LOG_FILE': 'py3.log', 'SPIDER_LOADER_WARN_ONLY': True}
2018-03-16 19:17:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-03-16 19:17:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-16 19:17:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-16 19:17:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-16 19:17:24 [scrapy.core.engine] INFO: Spider opened
2018-03-16 19:17:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-16 19:17:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-16 19:17:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapy.org> (referer: None)
2018-03-16 19:17:24 [scrapy.core.scraper] ERROR: Spider error processing <GET https://scrapy.org> (referer: None)
Traceback (most recent call last):
  File "/Users/daniel/envs/scrapy3/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/daniel/myspider.py", line 12, in parse
    u'příprav' in None.lower()
AttributeError: 'NoneType' object has no attribute 'lower'
2018-03-16 19:17:24 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-16 19:17:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 209,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 14584,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 16, 22, 17, 24, 711239),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'memusage/max': 49725440,
 'memusage/startup': 49725440,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/AttributeError': 1,
 'start_time': datetime.datetime(2018, 3, 16, 22, 17, 24, 196146)}
2018-03-16 19:17:24 [scrapy.core.engine] INFO: Spider closed (finished)

The problem is clearly that the non-ascii char can’t be formatted in the stack trace. It can be a bug in Twisted or Python logger.

The relevant part of Scrapy that triggers the bug is at https://github.com/scrapy/scrapy/blob/6cc6bbb5fc5c102271829a554772effb0444023c/scrapy/core/scraper.py#L154-L159

where failure_to_exc_info() is quire simplistic https://github.com/scrapy/scrapy/blob/6cc6bbb5fc5c102271829a554772effb0444023c/scrapy/utils/log.py#L20-L23