scrapy: Document that Mailsender.send() returns a Deferred

Hi, I’m new to scrapy and I want to send some emails after the spider closed. But I got some errors, anyone know ? I’m using python2.7 and scrapy 1.5.1. Here are my codes:

class AlertSpider(scrapy.Spider):
    name = "alert"
    start_urls = ['http://www.test.com']
    mails = []

    def parse(self, response):
        # Do something work

    @classmethod
    def from_crawler(cls, crawler):
        spider = cls()
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        settings = get_project_settings()
        mailer = MailSender.from_settings(settings)
       # first e-mail
        mailer.send(to=["xxxx@gmail.com"], subject='subject1', body='body1')
       # second e-mail
        return mailer.send(to=["xxxx@gmail.com"], subject='subject2', body='body2')

I want to send two e-mails after the spider close, but I get below errors: (By the way, there is no problem if I just send one e-mail)

File "C:\Software\Python27\lib\site-packages\twisted\internet\selectreactor.py", line 149, in _doReadOrWrite why = getattr(selectable, method)() File "C:\Software\Python27\lib\site-packages\twisted\internet\tcp.py", line 243, in doRead return self._dataReceived(data) File "C:\Software\Python27\lib\site-packages\twisted\internet\tcp.py", line 249, in _dataReceived rval = self.protocol.dataReceived(data) File "C:\Software\Python27\lib\site-packages\twisted\protocols\tls.py", line 330, in dataReceived self._flushReceiveBIO() File "C:\Software\Python27\lib\site-packages\twisted\protocols\tls.py", line 300, in _flushReceiveBIO self._flushSendBIO() File "C:\Software\Python27\lib\site-packages\twisted\protocols\tls.py", line 252, in _flushSendBIO bytes = self._tlsConnection.bio_read(2 ** 15) exceptions.AttributeError: 'NoneType' object has no attribute 'bio_read'

It seems to the twisted doesn’t close the io, but I don’t find any close method in MailSender class, so anyone have met this error?

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 15 (1 by maintainers)

Most upvoted comments

I have an email pipeline that sends email during process_item and have the error 'NoneType' object has no attribute 'bio_read', e.g., something like:

def process_item(self, item, spider):
  if meets_criteria:
    mailer.send(...)
  return item

Changing the function to async and use await seems to solve it for me, as mailer.send returns the deferred object.

async def process_item(self, item, spider):
  if meets_criteria:
    await mailer.send(...)
  return item

Not sure if this is the right way to solve but it seems working for me.

Same problem here. I try to send emails in the “close_spider” method of the pipeline class , because I have serveral spiders and I don’t want to add the “sending email” code serveral times. After I repalce “mailer.send(…)” with “return mailer.send(…)”,the problem disappered

self._send_mail(body,subject).addCallback(lambda x: x)

test_spider.py

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    mails = []

    def __init__(self, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        pass

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(QuotesSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_closed(self):
        settings = get_project_settings()
        mailer = MailSender.from_settings(settings)
        mailer.send(to=[“XX@gmail.com"], subject='subject2', body='body2')

Hello, it looks like the problem lies in use of Twisted Deferred class in Scrapy.

MailSender.send() returns Twisted deferred object (see line 106 in module scrapy.mail) with callbacks _sent_ok and _sent_failed for success and failure accordingly. (Line 102 in scrapy.mail).

Use of MailerSend.send() in spider_closed generates logs where the spider is closed and then mail is sent - looks like expected behaviour.

2019-06-02 19:54:08 [scrapy.core.engine] INFO: Spider closed (finished) 2019-06-02 19:54:10 [scrapy.mail] INFO: Mail sent OK: To=[‘XXXX@gmail.com’] Cc=[] Subject=“subject2” Attachs=0

However, you get the error in traceback: builtins.AttributeError: ‘NoneType’ object has no attribute ‘bio_read’ bytes = self._tlsConnection.bio_read(2 ** 15)

My explanation of the error: As far as I understand the end of Scrapy crawler work triggers Twisted reactor/main loop shutdown and disconnectAll() when callback _sent_ok or _sent_failed has not been executed. The callback tries to communicate through lost TLS connection.

The error itself is the result of TLSMemoryBIOProtocol.connectionLost() triggered by end of crawler work where attribute _tlsConnection is assigned None (see line 407 twisted.protocols.tls). This line self._tlsConnection = None was added to Twisted in March of 2018 (see pull request for reference https://github.com/twisted/twisted/pull/955). Without this pull request no error present in the same way of MailSender.send() execution.

As a workaround and based on my very little knowledge of Twisted Deferred class and Scrapy I can propose the following: One way to guarantee that Twisted reactor/main loop is not shut down before MailSender.send() has finished with its callbacks is to return the resulting Deferred instance. See example:

def spider_closed(self): settings = get_project_settings() mailer = MailSender.from_settings(settings) return mailer.send(to=[“XXXX@gmail.com"], subject=‘subject2’, body=‘body2’)

In this case reactor/main loop shutdown process will wait.

You can see it from logs:

2019-06-02 20:00:20 [scrapy.core.engine] INFO: Closing spider (finished) 2019-06-02 20:00:22 [scrapy.mail] INFO: Mail sent OK: To=[‘XXXX@gmail.com’] Cc=[] Subject=“subject2” Attachs=0 2019-06-02 20:00:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 2019-06-02 20:00:22 [scrapy.core.engine] INFO: Spider closed (finished)

My question to Scrapy owners, @Gallaecio, can we consider the workaround as a fix and change documentation for MailSender.send() ? Or can someone continue digging into Twisted world and propose some more valuable adjustments for using Deferred in Scrapy?

My question to Scrapy owners, @Gallaecio, can we consider the workaround as a fix and change documentation for MailSender.send() ?

It’s not a workaround but the correct usage of this function, or other functions that return a Deferred instead of waiting until the action is done. It indeed makes sense to mention in the docs that you are supposed to wait for the deferred instead of just calling this function and assuming its synchronous.