salt: salt-master crashes every few days

Description of Issue/Question

Every few days (4 or 5 days) salt-master stops responding and we need to restart salt-master.

Setup

We have a number of salt-syndics that connect to this salt-master. Syndics are very old version (2014.7.5). We get a fair number of “Salt minion claiming to be <redacted> has attempted tocommunicate with the master and could not be verified” messages in the logs but functionally everything works until salt-master crashes.

Steps to Reproduce Issue

2016-06-22 08:31:39,001 [salt.master                                          ][ERROR   ][7815] Unable to decrypt token: Unable to decrypt message
2016-06-22 08:31:39,006 [salt.master                                          ][ERROR   ][7815] Salt minion claiming to be <redacted> has attempted tocommunicate with the master and could not be verified
2016-06-22 08:31:39,006 [salt.master                                          ][WARNING ][7815] Minion id <redacted> is not who it says it is!
2016-06-22 08:31:39,455 [salt.transport.ipc                                   ][ERROR   ][7786] Exception occurred in Subscriber while handling stream: Already reading
2016-06-22 08:31:39,456 [salt.utils.process                                   ][ERROR   ][7786] An un-handled exception from the multiprocessing process 'Reactor-6' was caught:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/utils/process.py", line 613, in _run
    return self._original_run()
  File "/usr/lib/python2.6/site-packages/salt/utils/reactor.py", line 215, in run
    for data in self.event.iter_events(full=True):
  File "/usr/lib/python2.6/site-packages/salt/utils/event.py", line 620, in iter_events
    data = self.get_event(tag=tag, full=full, match_type=match_type)
  File "/usr/lib/python2.6/site-packages/salt/utils/event.py", line 579, in get_event
    ret = self._get_event(wait, tag, match_func, no_block)
  File "/usr/lib/python2.6/site-packages/salt/utils/event.py", line 484, in _get_event
    raw = self.subscriber.read_sync(timeout=wait)
  File "/usr/lib/python2.6/site-packages/salt/transport/ipc.py", line 654, in read_sync
    return ret_future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.6/site-packages/salt/transport/ipc.py", line 631, in _read_sync
    raise exc_to_raise  # pylint: disable=E0702
AssertionError: Already reading
2016-06-22 08:31:39,475 [salt.master                                          ][ERROR   ][7814] Error in function _minion_event:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1540, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1306, in _minion_event
    self.masterapi._minion_event(load)
  File "/usr/lib/python2.6/site-packages/salt/daemons/masterapi.py", line 752, in _minion_event
    self.event.fire_event(event, event['tag'])  # old dup event
  File "/usr/lib/python2.6/site-packages/salt/utils/event.py", line 662, in fire_event
    self.io_loop.run_sync(lambda: self.pusher.send(msg))
  File "/usr/lib64/python2.6/site-packages/tornado/ioloop.py", line 444, in run_sync
    return future_cell[0].result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/ipc.py", line 389, in send
    yield self.stream.write(pack)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
StreamClosedError

and the StreamClosedError exception keeps repeating about 15 times (within 2 or 3 seconds) and then the log stops completely.

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Salt-Master (master of masters that crashes every few days):

[root@salty salt]# salt --versions
Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.2.2
       dateutil: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: 0.20.0
        libnacl: Not Installed
       M2Crypto: 0.20.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: 0.20.3
         Python: 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.5.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: 0.2.4
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: centos 6.8 Final
        machine: x86_64
        release: 2.6.32-642.1.1.el6.x86_64
         system: Linux
        version: CentOS 6.8 Final

On the Salt-syndics with older versions:

               Salt: 2014.7.5
             Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
             Jinja2: 2.2.1
           M2Crypto: 0.20.2
     msgpack-python: 0.4.6
       msgpack-pure: Not Installed
           pycrypto: 2.0.1
            libnacl: Not Installed
             PyYAML: 3.10
              ioflo: Not Installed
              PyZMQ: 14.3.1
               RAET: Not Installed
                ZMQ: 3.2.5
               Mako: Not Installed

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 27 (22 by maintainers)

Most upvoted comments

OK, so I think we have a better idea of what’s going on here.

What this looks like is that we’re pushing data onto the Tornado write queue in the EventPublisher and that memory is never being reclaimed. There are a couple of possible factors in play here:

According to this documentation: http://www.tornadoweb.org/en/stable/iostream.html?highlight=iostream#tornado.iostream.BaseIOStream.write, any write that happens while another write is going on, the previous future will be orphaned.
Tornado iostreams, by default, have an unlimited write buffer. Setting the max_write_buffer_size when the stream is created, especially to a small size (like 1) will immediately quickly cause exactly the stacktraces seen above. Here’s a diff that should show this:

diff --git a/salt/transport/ipc.py b/salt/transport/ipc.py
index b6737c3..9b58963 100644
--- a/salt/transport/ipc.py
+++ b/salt/transport/ipc.py
@@ -509,6 +511,7 @@ class IPCMessagePublisher(object):
             stream = IOStream(
                 connection,
                 io_loop=self.io_loop,
+                max_write_buffer_size=1
             )
             self.streams.add(stream)
         except Exception as exc:

I think that what we likely want to do here is to set a max_write_buffer_size dynamically, perhaps as some percentage of memory on the machine. We should also allow it to be set statically.

Upon filling the buffer, we’ll need to drop messages. We should log these, hopefully batching the warnings together.

We may also wish to consider an option to allow users to select the use of ZMQ for the event bus. (Though we totally removed this a while ago.) Some users may wish to use the buffering algorithms already implemented there.

cachedout on Jul 6, 2016