evennia: [BUG] Portal memory leak
Describe the bug
User jumpscare (unclear what their github handle is) reported their game’s Portal leaks memory over time (total machine memory usage):
The server has 4GB memory, so this represents ~500MB rise over 14 days.
Here’s after a Server reload (Portal remained up):
This behavior is not seen with an idle server, so it seems to have somethig to do with the session store. The server is confirmed to run with DEBUG=False.
To Reproduce
Unclear how to reproduce. This appears like a memory leak.
Expected behavior
The portal should not leak memory.
Environment, Evennia version, OS etc
Linux, remote machine.
Additional context
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 19 (10 by maintainers)
Upon the investigation of _send_nop as the only remaining reference, I think this is the hardest conclusion we’ve had.
Without my patch, this happens:
What you are seeing here is LoopingCalls stacking up (this is the reference to bound method _send_nop).
With my patch to telnet.py:
There is only ever n LoopingCalls for n sessions.
So this is definitely another leak! Whether it is the leak, I am not sure but it seems likely since we were pretty sure it was Session related.
I also verified that after the LoopingCall is appropriately dispatched, there are 0 remaining references to an old session. In my next old_session the debugger was the only reference, and I yoinked it straight out of memory with ctypes so the fact that I could do that does not mean that Python had not marked said memory as freed.
The only way to be more sure would I guess be to check
is_finalizedin the GC, but I’m comfortable enough giving it a shot.This now is 2 memory leaks I have found in the course of this investigation! Which is also why I’m not 100% confident this will fix the leak because I’ve fixed one leak before. But it’s worth some testing now with a patch I believe.
After reading whitenoise’s extensive debug information from last night, I’ve rerun my make-200-connections script after adding this to the end of
TelnetProtocol.connectionLost:While the memory used by the server and portal processes does go up every now and then if I’m spamming the connections really fast, it’s no longer going up with every connection, and at a slower pace (1 per 5 seconds) it not only stabilized, it even gave a small amount of the working memory back to the OS.
There was some hypothesizing that it has something specifically to do with creating sessions, so I did a small test to see if that holds water.
I spun up my default test server (which has no scripts or anything running, just the default Limbo, the superuser account and character, and a few random Objects) and let it idle for about half an hour, where it sat at this memory usage the whole time:
(that’s about 82MB for the portal and 80MB for the server, if I read it correctly)
Then, after that half an hour, I ran a script to connect/authenticate/quit via telnet 200 times, which took a minute or two. The memory usage then looked like this:
(that’s an increase of 3MB for each of the server and the portal)
A reload of the server (not portal) reset the server back down to where it had started:
It seems like pretty solid evidence for the session handling being the issue. I noticed that the memory went up on the portal with every connection attempt by relatively even amounts, as well - I didn’t track the specific numbers, but it looked like around 12-20KB I think? Which in retrospect matches the memory increase perfectly - if it’s like 14KB per session which never gets released, and I created 200 sessions, that’s a little under 3MB.
***edit: Minor amendment, but the logs do verify that the sessions were supposedly being removed on quitting, rather than me incorrectly somehow multisessioning. There are 200 attempts worth of logs like this:
Thanks @InspectorCaracal! I think your finding here might confirm my fear/suspicion that we are in fact dealing with multiple leaks.
I think it’s time for a PR–there are several areas where I introduced weakrefs in order to pare us down to the point where the _send_nop was remaining. I think that’s enough work that it should be a PR and we can have a portion of success here. There is more memory work to go, but perhaps those can be their own issues down the road.
Thank you @Griatch I was able to get the debugger working and that was a huge time-saver.
I may have figured out the problem!
First some notes:
Ok, so tonight as I was starting a Telnet session started out having around 20 references to it. I got it down to 6 by cleaning up some code, and then used objgraph to give myself a nice visual of the remaining. Of those, 4 were bound methods, so no worries there. 1 was handshake_done()–another method that just passes self, so that shouldn’t be it. The last one, however, was the Telnet server itself from Twisted.
The graph:
Here is what I believe the problem is:
This is the TelnetProtocol class, yes? Yes.
It is given to a Twisted ServerFactory:
Said factory is set up here:
Normally this would not be an issue. EXCEPT: notice that the server keeps a reference to the protocol, and then notice this:
In Evennia, the session object and the protocol object are the same. I believe this will cause it to never be released as long as the server is running.
This matches with: