scrapy: Broad crawl possible memory leak

Hi,

I was doing broad crawl and noticed constantly increasing memory consumption for a spider. Pruning my spider to most simple form doesn’t help me here (memory still increases constantly). I also noticed that others spiders (with much smaller crawl rates, CONCURRENT_REQUESTS = 16) don’t have such problem.

So i was wondering if I misuse scrapy or there is a problem. Brief issue search didn’t show anything, so I went ahead and created experimental spider for tests: https://github.com/rampage644/experimental

First, I’d like to know if someone has experienced memory problems with high rate crawl or another memory problem.
Second, I’d like to figure out why this simple spider leaks and can we do anything about that?

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 23 (9 by maintainers)

Most upvoted comments

@lopuhin, I’m using 100 CONCURRENT_REQUESTS setting and able to get 1200 rpm with 1 unit. On startup top shows 60M rss size, in 30 minutes it grows up to 300M

rampage644 on Jun 24, 2016

I’ve done some more experiments t pin down what is causing a leak:

First I removed all links that caused error messages in log (because of downloader errors). Good news are that memory footprint was reduced from 440MB to 300MB at peak (according to stats). Bad news are it’s still there. (Error entries in log count reduced from 20k to 2k).
Second. Long in the past i noticed that sometimes Requests objects are stuck somewhere (according to prefs() output and live_refs info). There is a pattern here. pprint.pprint(map(lambda x: (x[0], time.time()-x[1]), sorted(rqs.items(), key=operator.itemgetter(1)))) prints requests objects sorted by their creation time. Once Requests object start staying alive a group of them with pretty same time (>60s) appear in a tracking dict. That could happen multiple time, i.e. multiple groups.
Third finding: While working on some focused crawl spider and trying to use guppy it shows nothing interesting: str, tuples, dict. But here i get bunch of twisted objects staying in memory (maybe they are present in other spiders, but here there are much more of them):

>>> hpy.heap()                                                                                                                                                                                                                                                                                                                
Partition of a set of 1286140 objects. Total size = 176360928 bytes.                                                                                                                                                                                                                                                          
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)                                                                                                                                                                                                                                                     
     0 456074  35 29027064  16  29027064  16 str                                                                                                                                                                                                                                                                              
     1 261200  20 20719536  12  49746600  28 tuple                                                                                                                                                                                                                                                                            
     2 205542  16 16970096  10  66716696  38 list                                                                                                                                                                                                                                                                             
     3  17046   1 16640016   9  83356712  47 dict (no owner)                                                                                                                                                                                                                                                                  
     4  14746   1 15453808   9  98810520  56 dict of twisted.internet.base.DelayedCall                                                                                                                                                                                                                                        
     5   8403   1  8277960   5 107088480  61 dict of twisted.internet.defer.Deferred                                                                                                                                                                                                                                          
     6   7685   1  8053880   5 115142360  65 dict of twisted.internet.tcp.Client                                                                                                                                                                                                                                              
     7   7685   1  8053880   5 123196240  70 dict of twisted.internet.tcp.Connector                                                                                                                                                                                                                                           
     8   7677   1  8045496   5 131241736  74 dict of twisted.web._newclient.HTTP11ClientProtocol                                                                                                                                                                                                                              
     9  51931   4  4154480   2 135396216  77 types.MethodType

I’m going to try @kmike advice regarding tracemalloc module.

rampage644 on Jun 30, 2016