salt: Salt Tornado API: salt.exceptions.SaltReqTimeoutError: Message timed out

Description of Issue/Question

We just started using the salt API a few months ago and we cannot fix this issue. Currently, we are using Pepper to make the requests. I have created an “async” wrapper around it which runs a local_async command, polls with the jobs.exit_success runner for 30 seconds to a minute, then does a lookup_jid to get the results. Currently, the only thing we use this for is to retrieve all of the grains from the minions and store them in a database (just running grains.items non-batch mode). The output below does not contain the salt API logs from the successful local_async command and successful exit_success runners. It only contains the one that eventually fails.

When using the tornado API, about 2/3 of my “jobs” fail with the following message (trace logging on):

2019-05-16 06:08:28,354 [salt.utils.lazy  :104 ][DEBUG   ][148120] LazyLoaded jobs.exit_success
2019-05-16 06:08:28,354 [salt.transport.zeromq:136 ][DEBUG   ][148120] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'ch0pslop022_master', 'tcp://127.0.0.1:4506', 'clear')
2019-05-16 06:08:28,354 [salt.transport.zeromq:202 ][DEBUG   ][148120] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
2019-05-16 06:08:28,354 [salt.transport.zeromq:1084][DEBUG   ][148120] Trying to connect to: tcp://127.0.0.1:4506
2019-05-16 06:09:28,391 [salt.transport.zeromq:1139][DEBUG   ][148120] SaltReqTimeoutError, retrying. (1/3)
2019-05-16 06:10:28,451 [salt.transport.zeromq:1139][DEBUG   ][148120] SaltReqTimeoutError, retrying. (2/3)
2019-05-16 06:11:28,509 [salt.transport.zeromq:1139][DEBUG   ][148120] SaltReqTimeoutError, retrying. (3/3)
2019-05-16 06:12:28,569 [salt.transport.zeromq:1109][DEBUG   ][148120] Re-init ZMQ socket: Message timed out
2019-05-16 06:12:28,570 [salt.transport.zeromq:1084][DEBUG   ][148120] Trying to connect to: tcp://127.0.0.1:4506
2019-05-16 06:12:28,570 [salt.loaded.int.netapi.rest_tornado.saltnado:926 ][ERROR   ][148120] Unexpected exception while handling request:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/netapi/rest_tornado/saltnado.py", line 919, in disbatch
    chunk_ret = yield getattr(self, '_disbatch_{0}'.format(low['client']))(low)
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python3/dist-packages/salt/netapi/rest_tornado/saltnado.py", line 1114, in _disbatch_runner
    pub_data = self.saltclients['runner'](chunk)
  File "/usr/lib/python3/dist-packages/salt/runner.py", line 126, in cmd_async
    return mixins.AsyncClientMixin.cmd_async(self, reformatted_low)
  File "/usr/lib/python3/dist-packages/salt/client/mixins.py", line 505, in cmd_async
    return self.master_call(**low)
  File "/usr/lib/python3/dist-packages/salt/client/mixins.py", line 140, in master_call
    ret = channel.send(load)
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 64, in wrap
    ret = self._block_future(ret)
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 74, in _block_future
    return future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 331, in send
    ret = yield self._uncrypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 320, in _uncrypted_transfer
    tries=tries,
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).) We have a 3 master multi-master setup. The master which this is running on is a 16 core 128 GB RAM physical host (plenty of hardware).

Steps to Reproduce Issue

(Include debug logs if possible and relevant.) See description of issue. May be complicated to reproduce but I can hand over more code if needed.

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

[zbarr@saltmaster03: ~]$ salt --versions-report
Salt Version:
           Salt: 2018.3.4

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.5.0
       dateutil: 2.4.2
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.2 (default, Nov 23 2017, 16:37:01)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 17.1.2
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.2.5

System Versions:
           dist: Ubuntu 16.04 xenial
         locale: UTF-8
        machine: x86_64
        release: 4.4.0-119-generic
         system: Linux
        version: Ubuntu 16.04 xenial

Please advise. Thanks!

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 23 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Still occurring on 3006.1 master with 3006.1 minions.

It’s a 8-core machine with 16 GB RAM. We have about 500 minions and the load average is typically around 1.2. worker_threads is set to 12 and sock_pool_size set to 15.