evennia: [BUG] Portal memory leak

Describe the bug

User jumpscare (unclear what their github handle is) reported their game’s Portal leaks memory over time (total machine memory usage): c90f5aa3878a42998961c73cb8883857

The server has 4GB memory, so this represents ~500MB rise over 14 days.

Here’s after a Server reload (Portal remained up):

cffb988165f0e76a3fa700ab3a9cc3f0

This behavior is not seen with an idle server, so it seems to have somethig to do with the session store. The server is confirmed to run with DEBUG=False.

To Reproduce

Unclear how to reproduce. This appears like a memory leak.

Expected behavior

The portal should not leak memory.

Environment, Evennia version, OS etc

Linux, remote machine.

Additional context

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 19 (10 by maintainers)

Most upvoted comments

Upon the investigation of _send_nop as the only remaining reference, I think this is the hardest conclusion we’ve had.

Without my patch, this happens:

> /home/sheridan/code/evennia/evenv/src/evennia/evennia/server/portal/telnet.py(153)connectionMade()->None

(Pdb) import twisted
(Pdb) from pympler import muppy
(Pdb) all_objs = muppy.get_objects()
(Pdb) looping_calls = muppy.filter(all_objs, Type=twisted.internet.task.LoopingCall)
(Pdb) looping_calls
[LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{}), LoopingCall<60>(_portal_maintenance, *(), **{})]
(Pdb) c

--Return--
> /home/sheridan/code/evennia/evenv/src/evennia/evennia/server/portal/telnet.py(153)connectionMade()->None
-> set_trace()
(Pdb) import twisted
(Pdb) from pympler import muppy
(Pdb) all_objs = muppy.get_objects()
(Pdb) looping_calls = muppy.filter(all_objs, Type=twisted.internet.task.LoopingCall)
(Pdb) looping_calls
[LoopingCall<60>(_portal_maintenance, *(), **{}), LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{}), LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{})]
(Pdb) import gc
(Pdb) gc.collect
<built-in function collect>
(Pdb) gc.collect()
0
(Pdb)

What you are seeing here is LoopingCalls stacking up (this is the reference to bound method _send_nop).

With my patch to telnet.py:

    def connectionLost(self, reason):
        """
        this is executed when the connection is lost for whatever
        reason. it can also be called directly, from the disconnect
        method

        Args:
            reason (str): Motivation for losing connection.

        """
        self.sessionhandler.disconnect(self)
        if self.nop_keep_alive and self.nop_keep_alive.running:
            self.toggle_nop_keepalive()
        self.transport.loseConnection()

There is only ever n LoopingCalls for n sessions.

So this is definitely another leak! Whether it is the leak, I am not sure but it seems likely since we were pretty sure it was Session related.

I also verified that after the LoopingCall is appropriately dispatched, there are 0 remaining references to an old session. In my next old_session the debugger was the only reference, and I yoinked it straight out of memory with ctypes so the fact that I could do that does not mean that Python had not marked said memory as freed.

The only way to be more sure would I guess be to check is_finalized in the GC, but I’m comfortable enough giving it a shot.

This now is 2 memory leaks I have found in the course of this investigation! Which is also why I’m not 100% confident this will fix the leak because I’ve fixed one leak before. But it’s worth some testing now with a patch I believe.

After reading whitenoise’s extensive debug information from last night, I’ve rerun my make-200-connections script after adding this to the end of TelnetProtocol.connectionLost:

        if self.nop_keep_alive and self.nop_keep_alive.running:
            self.nop_keep_alive.stop()

While the memory used by the server and portal processes does go up every now and then if I’m spamming the connections really fast, it’s no longer going up with every connection, and at a slower pace (1 per 5 seconds) it not only stabilized, it even gave a small amount of the working memory back to the OS.

There was some hypothesizing that it has something specifically to do with creating sessions, so I did a small test to see if that holds water.

I spun up my default test server (which has no scripts or anything running, just the default Limbo, the superuser account and character, and a few random Objects) and let it idle for about half an hour, where it sat at this memory usage the whole time: image (that’s about 82MB for the portal and 80MB for the server, if I read it correctly)

Then, after that half an hour, I ran a script to connect/authenticate/quit via telnet 200 times, which took a minute or two. The memory usage then looked like this: image (that’s an increase of 3MB for each of the server and the portal)

A reload of the server (not portal) reset the server back down to where it had started: image

It seems like pretty solid evidence for the session handling being the issue. I noticed that the memory went up on the portal with every connection attempt by relatively even amounts, as well - I didn’t track the specific numbers, but it looked like around 12-20KB I think? Which in retrospect matches the memory increase perfectly - if it’s like 14KB per session which never gets released, and I created 200 sessions, that’s a little under 3MB.

***edit: Minor amendment, but the logs do verify that the sessions were supposedly being removed on quitting, rather than me incorrectly somehow multisessioning. There are 200 attempts worth of logs like this:

2023-08-07 15:02:33 [SS] Authentication Success: admin(account 1) (IP: 127.0.0.1).
2023-08-07 15:02:33 [..] Logged in: admin(account 1) 127.0.0.1 (1 session(s) total)
2023-08-07 15:02:33 [..] Logged out: admin(account 1) 127.0.0.1 (0 sessions(s) remaining) (quit)
2023-08-07 15:02:34 [SS] Authentication Success: admin(account 1) (IP: 127.0.0.1).
2023-08-07 15:02:34 [..] Logged in: admin(account 1) 127.0.0.1 (1 session(s) total)
2023-08-07 15:02:34 [..] Logged out: admin(account 1) 127.0.0.1 (0 sessions(s) remaining) (quit)

Thanks @InspectorCaracal! I think your finding here might confirm my fear/suspicion that we are in fact dealing with multiple leaks.

I think it’s time for a PR–there are several areas where I introduced weakrefs in order to pare us down to the point where the _send_nop was remaining. I think that’s enough work that it should be a PR and we can have a portion of success here. There is more memory work to go, but perhaps those can be their own issues down the road.

Thank you @Griatch I was able to get the debugger working and that was a huge time-saver.

I may have figured out the problem!

First some notes:

  • Posting OS memory numbers is actually not meaningful in this scenario because Python memory manager doesn’t necessarily instantly give memory back to the OS even if Python marks it as free. You have to get the information from Python’s memory manager directly. In this way, tracemalloc, going directly into the frames, ctypes, etc, proved more useful in figuring out what was going on.
  • Similarly, OS-level allocation cannot be trusted as Python’s memory manager is doing its own thing with arenas and such.
  • objgraph is amazing

Ok, so tonight as I was starting a Telnet session started out having around 20 references to it. I got it down to 6 by cleaning up some code, and then used objgraph to give myself a nice visual of the remaining. Of those, 4 were bound methods, so no worries there. 1 was handshake_done()–another method that just passes self, so that shouldn’t be it. The last one, however, was the Telnet server itself from Twisted.

The graph: backref-graph-sessid-1

Here is what I believe the problem is:

TELNET_PROTOCOL_CLASS = "evennia.server.portal.telnet.TelnetProtocol"

This is the TelnetProtocol class, yes? Yes.

It is given to a Twisted ServerFactory:

    @classmethod
   def forProtocol(cls, protocol, *args, **kwargs):
       """
       Create a factory for the given protocol.

       It sets the C{protocol} attribute and returns the constructed factory
       instance.

       @param protocol: A L{Protocol} subclass

       @param args: Positional arguments for the factory.

       @param kwargs: Keyword arguments for the factory.

       @return: A L{Factory} instance wired up to C{protocol}.
       """
       factory = cls(*args, **kwargs)
       factory.protocol = protocol
       return factory

Said factory is set up here:

if TELNET_ENABLED:

    # Start telnet game connections

    from evennia.server.portal import telnet

    _telnet_protocol = class_from_module(settings.TELNET_PROTOCOL_CLASS)

    for interface in TELNET_INTERFACES:
        ifacestr = ""
        if interface not in ("0.0.0.0", "::") or len(TELNET_INTERFACES) > 1:
            ifacestr = "-%s" % interface
        for port in TELNET_PORTS:
            pstring = "%s:%s" % (ifacestr, port)
            factory = telnet.TelnetServerFactory()
            factory.noisy = False
            factory.protocol = _telnet_protocol
            factory.sessionhandler = PORTAL_SESSIONS
            telnet_service = internet.TCPServer(port, factory, interface=interface)
            telnet_service.setName("EvenniaTelnet%s" % pstring)
            PORTAL.services.addService(telnet_service)

            INFO_DICT["telnet"].append("telnet%s: %s" % (ifacestr, port))

Normally this would not be an issue. EXCEPT: notice that the server keeps a reference to the protocol, and then notice this:

class TelnetProtocol(Telnet, StatefulTelnetProtocol, _BASE_SESSION_CLASS):

In Evennia, the session object and the protocol object are the same. I believe this will cause it to never be released as long as the server is running.

This matches with:

  • The fact that this only happens on Portal. This infra does not exist Server side to my knowledge
  • The size of a session at the end of Portal’s connect() is right at 33 KB, which is roughly the size of the leak we are seeing