evennia: [BUG] Corrupt log locks server

Describe the bug

This is related to the old bug where the AMP sends garbage data in a loop between the portal and server. It seems when this information is dumped to the server.log, it locks down the entire process: the server can no longer be gracefully shut down, so it requires a kill -9 to stop every time. Interestingly, even trying to tail the log can freeze up a shell sometimes, at least when ssh’d in on MacOS. Using file on the log says it’s now a binary rather than text file.

Example output:

0-02-10 08:52:16-05 [::]     return Int16StringReceiver.dataReceived(self, data)
20-02-10 08:52:16-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/basic.py", line 766, in dataReceived
20-02-10 08:52:16-05 [::]     self.stringReceived(packet)
20-02-10 08:52:16-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/basic.py", line 878, in stringReceived
20-02-10 08:52:16-05 [::]     self.state = statehandler(string)
20-02-10 08:52:16-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 2459, in proto_key
20-02-10 08:52:16-05 [::]     self.boxReceiver.ampBoxReceived(self._currentBox)
20-02-10 08:52:16-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 1058, in ampBoxReceived
20-02-10 08:52:16-05 [::]     self._answerReceived(box)
20-02-10 08:52:16-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 984, in _answerReceived
20-02-10 08:52:16-05 [::]     question = self._outstandingRequests.pop(box[ANSWER])
20-02-10 08:52:16-05 [::] KeyError: b'5825'
20-02-10 08:52:16-05 [EE] Discarded incoming partial data: _ask3b3_commandMsgPortal2Server
20-02-10 08:52:16-05 [EE] packed_dataFxÚk`�jÁ¾�BµSzXJR+J¦ÄNÑ�²3SrR§$ÖNÑèaMÎMÉL�âÍÖÃ�_P���W<¥vJijqÛ=²Yc
20-02-10 08:52:17-05 [::] Traceback (most recent call last):
20-02-10 08:52:17-05 [::]   File "/home/tehom/arx/evennia/evennia/server/portal/amp.py", line 315, in dataReceived
20-02-10 08:52:17-05 [::]     super(AMPMultiConnectionProtocol, self).dataReceived(data)
20-02-10 08:52:17-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 2406, in dataReceived
20-02-10 08:52:17-05 [::]     return Int16StringReceiver.dataReceived(self, data)
20-02-10 08:52:17-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/basic.py", line 766, in dataReceived
20-02-10 08:52:17-05 [::]     self.stringReceived(packet)
20-02-10 08:52:17-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/basic.py", line 878, in stringReceived
20-02-10 08:52:17-05 [::]     self.state = statehandler(string)
20-02-10 08:52:17-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 2459, in proto_key
20-02-10 08:52:17-05 [::]     self.boxReceiver.ampBoxReceived(self._currentBox)
20-02-10 08:52:17-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 1058, in ampBoxReceived
20-02-10 08:52:17-05 [::]     self._answerReceived(box)
20-02-10 08:52:17-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 984, in _answerReceived
20-02-10 08:52:17-05 [::]     question = self._outstandingRequests.pop(box[ANSWER])
20-02-10 08:52:17-05 [::] KeyError: b'5825'
20-02-10 08:52:17-05 [EE] Discarded incoming partial data: _ask3b3_commandAdminPortal2Server
20-02-10 08:52:17-05 [EE] packed_data
20-02-10 08:52:17-05 [EE] xÚM�ÏJÃ@Æ[�6�¶">
20-02-10 08:52:17-05 [EE] -bí1MB(êF¢E/Öì¶·Ù�Ý�a
20-02-10 08:52:17-05 [EE] ÆgóuÜþQ:��o~³3óÞþ~j66q39yÃSpdÉ+ªsY 4��+e£�âÚÐ++©e&EúÂ_:���k�2V/0
20-02-10 08:52:17-05 [EE] ¹ÃáØ=w£
20-02-10 08:52:17-05 [EE] KÕ9C�M+�æ'°³� ÐÚÒv]Ð%7�#äbÁY�ø        V¹¥N&�"Õ¹qDÞÏ×£åù}8Î�,Tét�«üYì³îÛ«9ë���
20-02-10 08:52:17-05 [EE] ¼jÀÑÿ"sAj³\7$~
20-02-10 08:52:17-05 [EE] LI�f
20-02-10 08:52:17-05 [EE] =?»DèÝùI�$ô�01
20-02-10 08:52:17-05 [EE] LÉíì'³È¨Vâ=�Ç&±_Ç�×ë�U+^¥»�Õõº¿�3�Þ
20-02-10 08:52:18-05 [::] Traceback (most recent call last):
20-02-10 08:52:18-05 [::]   File "/home/tehom/arx/evennia/evennia/server/portal/amp.py", line 315, in dataReceived
20-02-10 08:52:18-05 [::]     super(AMPMultiConnectionProtocol, self).dataReceived(data)
20-02-10 08:52:18-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 2406, in dataReceived
20-02-10 08:52:18-05 [::]     return Int16StringReceiver.dataReceived(self, data)
20-02-10 08:52:18-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/basic.py", line 766, in dataReceived
20-02-10 08:52:18-05 [::]     self.stringReceived(packet)
20-02-10 08:52:18-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/basic.py", line 878, in stringReceived
20-02-10 08:52:18-05 [::]     self.state = statehandler(string)
20-02-10 08:52:18-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 2459, in proto_key
20-02-10 08:52:18-05 [::]     self.boxReceiver.ampBoxReceived(self._currentBox)
20-02-10 08:52:18-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 1058, in ampBoxReceived
20-02-10 08:52:18-05 [::]     self._answerReceived(box)
20-02-10 08:52:18-05 [::]   File "/home/tehom/arx/arxvenv/lib/python3.8/site-packages/twisted/protocols/amp.py", line 984, in _answerReceived
20-02-10 08:52:18-05 [::]     question = self._outstandingRequests.pop(box[ANSWER])
20-02-10 08:52:18-05 [::] KeyError: b'5825'
20-02-10 08:52:18-05 [EE] Discarded incoming partial data: _ask3b3_commandAdminPortal2Server
20-02-10 08:52:18-05 [EE] packed_dataTxÚ=POOÂ0
20-02-10 08:52:18-05 [EE]
20-02-10 08:52:18-05 [EE]òoxÐ/áÅ%9�®n
20-02-10 08:52:18-05 [EE] [»�"x"
20-02-10 08:52:18-05 [EE] ▒��l��OF�
20-02-10 08:52:18-05 [EE] <ÔoàwòëØÚËû½_ßï½ä½V¿¾Ë¥Ý
20-02-10 08:52:18-05 [EE] �§/ê\6ø6Íb±â▒��­<ÍsÍ�XĬmÆ_ðõü!}VòX¤ëM*�¬ÅI�i­�íΠow»ûÒîô{Z²³H�N�µÅ�ÙXðÍf.V�©rÁÏç¬`[�ü»/×ñ}¾K¬#©��«dµË�k%­1�"�Ô�lbN���W³L5Ô
20-02-10 08:52:18-05 [EE] Ô§æ¿A¨K�§åê]¶öÇSì0O�J#�ÿ9»
20-02-10 08:52:18-05 [EE] ÛoÏrÙ¾¡D"��   Ò·UÆîB3Tc��¡É¤th �f�
20-02-10 08:52:18-05 [EE] ��o�º>c(
20-02-10 08:52:18-05 [EE] ºW==7¡�aHwRU²b�s
20-02-10 08:52:18-05 [EE] ®
20-02-10 08:52:18-05 [EE] S|ö�fóCñEñ¦ì_�T�§

To Reproduce

Steps to reproduce the behavior:

  1. Random error in AMP 2.try to shut down server from console
  2. Command freezes
  3. Have to kill -9 the process

Expected behavior

The game to handle the AMP error gracefully and not cause the process to lock up.

Environment, Evennia version, OS etc

If unsure, run evennia -v or get the first few lines of the about command in-game.

Additional context

Getting rid of the discarded partial data bit might solve it, since that seems like it could be what’s screwing up the log. Alternately, making shutdown not write to the log so that it can be independent of a buggy log might work, or wrapping logging during shutdown in a deferred with an errback, maybe? Something like that.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 34 (22 by maintainers)

Commits related to this issue

Most upvoted comments

This is the tested and working version of my suggestion to manually clean the buffer: https://pastebin.com/jsuEpTKN

The downside is that because it interacts with lower layers of twisted infrastructure directly, it may break if Twisted ever updates that code and you update Twisted.

Thanks a lot for the discussion and help to figure this one out (or at least mitigate it)!

I have still not been able to determine exactly what causes this error at its core, but based on @Zironic’s work, I did test and incorporate a patch for Evennia 1.0-dev (develop branch). I took the slightly different route of replacing stringReceived on the AMPMultiConnectionPrototocol (adding a try...except KeyError inside it) - this meant copying less code out of the twisted library and seems to work with the added benefit of not disconnecting the server from the portal at any time - so it should not run into Tehom’s extended error above. Using the example code to trigger this, this solution seems to work correctly, but I’ve never been able to reproduce the original error … so if you try out develop and find that this solution still has issues, comment to re-open this again.

So I can sort-of replicate this behavior the brute-force way, by modifying Twisted itself (if someone’s interested in playing around with it):

In twisted/protocols/amp.py, line 976:


    always_fail = False
    def _answerReceived(self, box):
        """
        An AMP box was received that answered a command previously sent with
        L{callRemote}.

        @param box: an AmpBox with a value for its L{ANSWER} key.
        """
        import random
        if self.always_fail or random.random() < 0.01:
            box[ANSWER] = "random_dummybox"
            self.always_fail = True

        question = self._outstandingRequests.pop(box[ANSWER])
        question.addErrback(self.unhandledError)
        question.callback(box)

So there is a 1% chance the error emerges, and once it does, it will keep failing. It’s crude but it mimics the behavior and shows a few things:

  • The repeating error is not a retry-loop. It comes from new messages trying to pass over the connection and keep failing (on Arx this is presumably users entering commands, causing the errors to pile up). This is why the message size is varying in the tracebacks. It has nothing to do with multi-part message sending either.
  • As a test, I added a simple exception counter in amp.py’s AMPMultiConnectionProtocol.dataReceived that re-raises the KeyError if there are more than X errors raised in a row. This resets the connection and stops the failing loop, at least in this case (since it’s a new protocol instance with only 1% chance of failing in my test).
  • Raising an exception is not ideal though, since this leads to the Server reloading, which is not what I want. Ideally I want to close the connection and re-open it, letting the Server (which is the AMP-client) auto-re-connect normally. I’ll need to investigate if one can actually do this cleanly.