scrapy: Broad crawl possible memory leak
Hi,
I was doing broad crawl and noticed constantly increasing memory consumption for a spider. Pruning my spider to most simple form doesn’t help me here (memory still increases constantly).
I also noticed that others spiders (with much smaller crawl rates, CONCURRENT_REQUESTS = 16) don’t have such problem.
So i was wondering if I misuse scrapy or there is a problem. Brief issue search didn’t show anything, so I went ahead and created experimental spider for tests: https://github.com/rampage644/experimental
- First, I’d like to know if someone has experienced memory problems with high rate crawl or another memory problem.
- Second, I’d like to figure out why this simple spider leaks and can we do anything about that?
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 23 (9 by maintainers)
@lopuhin, I’m using 100
CONCURRENT_REQUESTSsetting and able to get 1200 rpm with 1 unit. On startuptopshows 60M rss size, in 30 minutes it grows up to 300MI’ve done some more experiments t pin down what is causing a leak:
Requestsobjects are stuck somewhere (according toprefs()output andlive_refsinfo). There is a pattern here.pprint.pprint(map(lambda x: (x[0], time.time()-x[1]), sorted(rqs.items(), key=operator.itemgetter(1))))prints requests objects sorted by their creation time. OnceRequestsobject start staying alive a group of them with pretty same time (>60s) appear in a tracking dict. That could happen multiple time, i.e. multiple groups.twistedobjects staying in memory (maybe they are present in other spiders, but here there are much more of them):I’m going to try @kmike advice regarding
tracemallocmodule.