salt: [BUG] 3005 Salt Minion runs into timeout when uploading highstate report

Description

My Salt Minions runs every hour a self-scheduled highstate. When my Master is on high load, some Minions run during that time occasionally in the following error:

[...]
  File "salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

The Master log does not mention anything about an incoming connection. I think the Master cannot handle the requests due to its high load and the Minion runs into a timeout while it wants to upload its report. This happens to my Minions independent of their OS (see reports - Ubuntu 22 and Rocky Linux).

Setup

I have a single Master with around 20 Minions. Once a day (around 3am), the Master server runs some upgrades which result in a high load for ~1h. During this time, the described error occurs on multiple Minions.

Minion config

The Minion config which triggers an autmatic highstate evey hour.

master: <my.master>
log_level: warning

autosign_grains:
  - autosign_key



grains: 
  'autosign_key': '<my key>'  
schedule:
  highstate:
    function: state.highstate
    minutes: 60

The state is actually irrelevant, it just adapts the motd. The state runs fine, just not when the Master is under high load.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify): vSphere
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior

Debug logs are attached to the version specific details since there are two logs for the corresponding Minions.

Expected behavior If a timeout occurs during the authentication with the Master, I would expect a log message about it and not a stack trace in the logs.

Versions Report

Master (CentOS 7)

salt --versions-report
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: 1.14.5
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.11.3
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.20
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, Nov 16 2020, 16:55:22)
  python-gnupg: Not Installed
        PyYAML: 5.4.1
         PyZMQ: 18.0.1
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.1.4
 
System Versions:
          dist: centos 7 Core
        locale: UTF-8
       machine: x86_64
       release: 3.10.0-1160.76.1.el7.x86_64
        system: Linux
       version: CentOS Linux 7 Core

Minion A (Ubuntu 22.04)

salt --versions-report
Salt Version:
          Salt: 3005rc1
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.17
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
        Python: 3.9.12 (main, Jul 18 2022, 22:36:53)
  python-gnupg: 0.4.8
        PyYAML: 5.4.1
         PyZMQ: 23.2.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: ubuntu 22.04 jammy
        locale: utf-8
       machine: x86_64
       release: 5.15.0-48-generic
        system: Linux
       version: Ubuntu 22.04 jammy

Full stack trace

/var/log/salt/minion
2022-10-10 03:00:02,350 [salt.crypt       :509 ][DEBUG   ][72494] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'Minion host name', 'tcp://<Minion ip>:4506')
2022-10-10 03:00:02,373 [salt.state       :763 ][DEBUG   ][72494] Gathering pillar data for state run
2022-10-10 03:00:02,373 [salt.pillar      :71  ][DEBUG   ][72494] Determining pillar cache
2022-10-10 03:00:02,374 [salt.crypt       :509 ][DEBUG   ][72494] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion host name', 'tcp://<minion ip>:4506')
2022-10-10 03:00:03,042 [salt.utils.process:1155][DEBUG   ][43847] Subprocess Schedule(name=__mine_interval, jid=20221010030002042109) cleaned up
2022-10-10 03:01:02,441 [salt.utils.schedule:874 ][ERROR   ][72494] Unhandled exception running state.highstate
Traceback (most recent call last):
  File "salt/utils/schedule.py", line 841, in handle_func
    ret["return"] = self.functions[func](*args, **kwargs)
  File "salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "salt/loader/lazy.py", line 1228, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "salt/loader/lazy.py", line 1243, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/opt/saltstack/salt/run/salt/modules/state.py", line 1102, in highstate
    st_ = salt.state.HighState(
  File "salt/state.py", line 4769, in __init__
    self.state = State(
  File "salt/state.py", line 768, in __init__
    self.opts["pillar"] = self._gather_pillar()
  File "salt/state.py", line 833, in _gather_pillar
    return pillar.compile_pillar()
  File "salt/pillar/__init__.py", line 353, in compile_pillar
    ret_pillar = self.channel.crypted_transfer_decode_dictentry(
  File "salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "salt/channel/client.py", line 171, in crypted_transfer_decode_dictentry
    ret = yield self.transport.send(
  File "salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "salt/transport/zeromq.py", line 914, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "salt/transport/zeromq.py", line 624, in send
    recv = yield future
  File "salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
2022-10-10 03:01:02,462 [salt.utils.event :315 ][DEBUG   ][72494] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_e80f3588f1_pub.ipc
2022-10-10 03:01:02,462 [salt.utils.event :316 ][DEBUG   ][72494] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_e80f3588f1_pull.ipc
2022-10-10 03:01:02,470 [salt.utils.event :821 ][DEBUG   ][72494] Sending event: tag = __schedule_return; data = {'cmd': '_return', 'id': '<Minion host name>', 'fun': 'state.highstate', 'fun_args': [], 'schedule': 'highstate', 'jid': 'req', 'pid': 72494, 'return': 'Unhandled exception running state.highstate', 'success': False, 'retcode': 254, '_stamp': '2022-10-10T03:01:02.470071'}
2022-10-10 03:01:02,474 [salt.minion      :2689][DEBUG   ][43847] Minion of '<Master host name>' is handling event tag '__schedule_return'
2022-10-10 03:01:02,475 [salt.minion      :2191][INFO    ][43847] Returning information for job: req
2022-10-10 03:01:02,477 [salt.transport.ipc:373 ][DEBUG   ][72494] Closing IPCMessageClient instance

Minion B (Rocky Linux 8)

salt --versions-report
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.6.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10.1
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, May 19 2021, 03:00:47)
  python-gnupg: Not Installed
        PyYAML: 3.12
         PyZMQ: 20.0.0
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: rocky 8.4 Green Obsidian
        locale: UTF-8
       machine: x86_64
       release: 4.18.0-305.3.1.el8_4.x86_64
        system: Linux
       version: Rocky Linux 8.4 Green Obsidian
 

Full stack trace

/var/log/salt/minion
2022-10-13 04:30:48,431 [salt.utils.schedule:874 ][ERROR   ][90713] Unhandled exception running state.highstate
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/schedule.py", line 841, in handle_func
    ret["return"] = self.functions[func](*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1228, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
    return callable(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1243, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/modules/state.py", line 1110, in highstate
    initial_pillar=_get_initial_pillar(opts),
  File "/usr/lib/python3.6/site-packages/salt/state.py", line 4778, in __init__
    initial_pillar=initial_pillar,
  File "/usr/lib/python3.6/site-packages/salt/state.py", line 768, in __init__
    self.opts["pillar"] = self._gather_pillar()
  File "/usr/lib/python3.6/site-packages/salt/state.py", line 833, in _gather_pillar
    return pillar.compile_pillar()
  File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 355, in compile_pillar
    dictkey="pillar",
  File "/usr/lib/python3.6/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/usr/lib/python3.6/site-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 173, in crypted_transfer_decode_dictentry
    timeout=timeout,
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 914, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 624, in send
    recv = yield future
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 30 (21 by maintainers)

Most upvoted comments

In case it is still an issue for other users @afletch @tjyang @RobinWlund, maybe they can provide logs as @dmurphy18 described. If not, I think the issue can be closed.

OP confirmed to close the issue IF other users could not confirm it was a problem or provide logs to support the fact that it is still a problem.

I confirmed it’s still a problem and provided additional logs.

Here’s more evidence #64176 won’t fix it.

TCP Listener for the salt master :4506 port doesn’t respond with a simple TCP connection/close attempt using nc:

$ ss -f inet -l sport = :4506 && nc -v -z localhost 4506
Netid            State             Recv-Q            Send-Q                        Local Address:Port                         Peer Address:Port            Process
tcp              LISTEN            4097              4096                                0.0.0.0:4506                              0.0.0.0:*
nc: connect to localhost port 4506 (tcp) failed: Connection timed out

Note: on a healthy salt-master this is nearly instantaneous, even though nc incorrectly reports “connection refused”:

ss -f inet -l sport = :4506 && nc -v -z localhost 4506
Netid            State             Recv-Q            Send-Q                        Local Address:Port                         Peer Address:Port            Process
tcp              LISTEN            0                 1000                                0.0.0.0:4506                              0.0.0.0:*
nc: connect to localhost port 4506 (tcp) failed: Connection refused
Connection to localhost 4506 port [tcp/*] succeeded!

Here’s a SaltReqTimeoutError: Message timed out generated from a salt-call test.ping - which I’m quite confident does not use any pillar functionality 😃

05/03/23 14:47:08,712 [INFO salt.minion:749 pid=35922] Master <redacted> could not be reached, trying next master (if any)
05/03/23 14:47:08,715 [TRACE salt.transport.zeromq:32 pid=35922] Event monitor done!
05/03/23 14:47:08,718 [DEBUG salt.utils.network:2307 pid=35922] "<redacted>" Not an IP address? Assuming it is a hostname.
05/03/23 14:47:10,847 [DEBUG salt.minion:232 pid=35922] Master URI: tcp://<redacted>:4506
05/03/23 14:47:10,847 [DEBUG salt.crypt:514 pid=35922] Re-using AsyncAuth for ('/etc/salt/pki/minion', '1436A5A7-EBAD-4E51-8F58-EDBBFAA287EC', 'tcp://<redacted>:4506')
05/03/23 14:47:10,850 [DEBUG salt.transport.zeromq:156 pid=35922] Generated random reconnect delay between '1000ms' and '11000ms' (3976)
05/03/23 14:47:10,851 [DEBUG salt.transport.zeromq:163 pid=35922] Setting zmq_reconnect_ivl to '3976ms'
05/03/23 14:47:10,851 [DEBUG salt.transport.zeromq:167 pid=35922] Setting zmq_reconnect_ivl_max to '11000ms'
05/03/23 14:47:10,852 [TRACE salt.transport.zeromq:32 pid=35922] Event monitor start!
05/03/23 14:47:10,853 [DEBUG salt.crypt:208 pid=35922] salt.crypt.get_rsa_key: Loading private key
05/03/23 14:47:10,854 [DEBUG salt.crypt:900 pid=35922] Loaded minion key: /etc/salt/pki/minion/minion.pem
05/03/23 14:47:10,869 [DEBUG salt.utils.event:315 pid=35922] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_afb89ddbfa_pub.ipc
05/03/23 14:47:10,869 [DEBUG salt.utils.event:316 pid=35922] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_afb89ddbfa_pull.ipc
05/03/23 14:47:10,870 [DEBUG salt.crypt:216 pid=35922] salt.crypt.get_rsa_pub_key: Loading public key
05/03/23 14:47:10,874 [TRACE salt.channel.client:32 pid=35922] ReqChannel send clear load={'cmd': '_auth', 'id': '1436A5A7-EBAD-4E51-8F58-EDBBFAA287EC', 'nonce': '8dd4b82184034f81af1a7b35e472ffb8', 'autosign_grains': {}, 'token': b"<redacted>", 'pub': '<redacted>'}
05/03/23 14:47:40,877 [TRACE salt.channel.client:32 pid=35922] Failed to send msg SaltReqTimeoutError('Message timed out')
05/03/23 14:47:40,880 [WARNING salt.crypt:764 pid=35922] SaltReqTimeoutError: Message timed out
05/03/23 14:47:40,881 [DEBUG salt.channel.client:313 pid=35922] Closing AsyncReqChannel instance

It took nearly 8 months to get someone to pay attention to this issue… please don’t make us open another one for the exact same problem because this was closed prematurely.

If the master is under high load then you will get timeouts, yes.

@tjyang your first example above there is actually working, it attempts to connect to an IPv6 localhost (::1) which doesn’t have an open port, but succeeds when connecting to the IPv4 version of localhost (Ncat: Connected to 127.0.0.1:4506.).

Another (perhaps the best) indication of this problem is a high Recv-Q value as shown in the ss (or netstat) command output, for example:

# netstat -tlnp | sed -n '2p;/:4506/p'
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp     4097      0 0.0.0.0:4506            0.0.0.0:*               LISTEN      3192059/salt-master

Shows my salt-master PUB_RET :4506 Recv-Q at 4097 (it should be zero or very close to it).

My buddy ChatGPT explains Recv-Q quite well:

Recv-Q explanation

The “Recv-Q” column in the output of the Linux “netstat” command represents the amount of data (measured in bytes) that has been received by the network stack but has not yet been read by the application. In other words, it is the amount of data queued in the receive buffer, waiting to be processed by the application.

A non-zero value in the Recv-Q column might not necessarily indicate a problem, as it could be a transient situation where data is being received faster than the application can process it. However, if the Recv-Q value is consistently high or increasing over time, it may indicate an issue with the application, such as slow processing or being blocked due to other reasons.

To troubleshoot high or increasing Recv-Q values, you can take the following steps:

Identify the process: Use the “netstat” command with the “-p” flag to identify the process associated with the socket showing the high Recv-Q value. This will help you understand which application is experiencing the issue.

netstat -tulnp | grep <port_number>

Monitor application performance: Check the CPU and memory usage of the application to see if it is under heavy load, causing it to process incoming data slowly. You can use tools like “top” or “htop” to monitor system resource usage.

Check application logs: Inspect the application logs to look for any errors or indications of slow processing. This may help you identify the cause of the issue.

Increase buffer sizes: If the application is working as expected, but the Recv-Q is still high, consider increasing the socket receive buffer size using the “sockopt” function with the SO_RCVBUF option. This can help accommodate higher amounts of incoming data.

Optimize application code: If the application is struggling to keep up with incoming data, optimize its code to improve performance. This may involve refactoring or rewriting parts of the application to process data more efficiently.

Scale the application: If the application cannot be optimized further and is consistently receiving more data than it can handle, consider scaling the application horizontally by adding more instances or vertically by allocating more resources to the application.

Remember that occasional non-zero Recv-Q values are normal and not always a cause for concern. However, consistently high or increasing values warrant investigation and possible action to resolve the underlying issue.

Also seeing this regularly after updating to 3005 (master and minion). Does not appear to affect 3004 minions talking to 3005 master. Causes minions to lose ‘connected’ status with the master, so master cannot send commands to the minion. The minion however can still send commands to the master. eg. the minion can saltutil.sync_all just fine, but the same command run on the master produced a job that never returns. Environment: master: 3005.1 CentOS 7

minions: 3005.1 various (e.g. AlmaLinux 8, CloudLinux 8, CentOS 7, CloudLinux 7)

@rrrix The originator of the issue said it was okay to close, and the fix in #64176 resolves the saltutils.sync_all issue that @tjyang had. If memory serves me correctly, I believe the highstate will cause a refresh of grains, which at present will clear the pillar cache, which #64176 fixes.

Salt 3006.1 is very soon, hence I will leave this closed, but if 3006.1 does not resolve your issues, please open a new issue with your conditions to reproduce the problem. And yes, it is well-known that “render pillar cache” is time consuming, that is why it is recommended to limit the amount of data in pillars.

@tjyang Your command to make the timeout occur

sudo time salt-call saltutil.sync_all;date

Suggests the upcoming fix https://github.com/saltstack/salt/pull/64176 which addresses an issue where a sync_all would clear the pillar cache, thus causing it to have to be rebuilt, which is a time-consuming operation. The fix should be available in 3006.1 which will be released soon.

@bastian-src Thanks for the information, given @tjyang 's additional information, I believe https://github.com/saltstack/salt/pull/64176 fixes the issue encountered, hence closing this issue as you suggested.

I don’t see this now because we had to throw resources at the master to make it go away. I don’t believe it has gone away by other means.

@bastian-src

I use this command “sudo time salt-call saltutil.sync_all;date”, to identify timeout issue. Currently I don’t see timeout message on my salt masters.

  • multi master01 on 3005.1
local:
    ----------
    beacons:
    clouds:
    engines:
    executors:
    grains:
    log_handlers:
    matchers:
    modules:
    output:
    proxymodules:
    renderers:
    returners:
    sdb:
    serializers:
    states:
    thorium:
    utils:
3.62user 1.02system 0:10.67elapsed 43%CPU (0avgtext+0avgdata 67732maxresident)k
1544inputs+8outputs (0major+78001minor)pagefaults 0swaps
Wed May  3 05:52:33 CDT 2023
  • multi master02 on 3006.0
local:
    ----------
    beacons:
    clouds:
    engines:
    executors:
    grains:
    log_handlers:
    matchers:
    modules:
    output:
    proxymodules:
    renderers:
    returners:
    sdb:
    serializers:
    states:
    thorium:
    utils:
2.13user 0.39system 0:15.13elapsed 16%CPU (0avgtext+0avgdata 63628maxresident)k
1024inputs+8outputs (0major+45128minor)pagefaults 0swaps
Wed May  3 06:57:27 EDT 2023

@dmurphy18 and @cmcmarrow thanks a lot for your comments!

I’ve created this issue since the error produced an unexpected error in our tool which it couldn’t handle (have a look here). I’ve added a PR to enable it handle the error and afterwards, I didn’t have a look whether it actually occurs anymore.

That being said, I just checked the log files of several Minions, but none had the issue anymore. No update of any Salt package (Master/Minion) happened since then - so, I assume maybe the Master hasn’t been under such high load again.

In case it is still an issue for other users @afletch @tjyang @RobinWlund, maybe they can provide logs as @dmurphy18 described. If not, I think the issue can be closed.

Thanks @dmurphy18 and @cmcmarrow for the hints in your posts and for the extensive explanation!

Hi @bastian-src do you have pillar cache enabled? If so #64081 might be your fix. Any and all high states will clear the pillar cache when ran which takes up alot of CPU and MeM for the master.

Also this issue which is being fixed https://github.com/saltstack/salt/issues/64081 which may be the reason for the timeouts occurring also, since a refresh_grains will also clear the pillar data, which means it has to be recompiled which is a time consuming operation, leading to a master getting over whelmed if having to do so for many minions.

@bastian-src Before digging heavily into this, thought I’d check if you tried adjusting some configuration values to see if they help with the issues encountered:

Master Config: worker_threads - default is 5, have you tried to double this or some other value to see if it makes a difference. gather_job_timeout - default is 10, could double it to see if it makes a difference timeout - default is 5 seconds, can you try doubling and see if any affect. In the past I have had to increase this value, when building the classic packaging which used Salt state and orchestration to generate OS VMs on AWS, build the various Salt release and then destroy the VMs used. Which is similar to the use-case here.

Minion Config: enable_fqdns_grains - prior to 3006.0, this defaulted True, which could induce a delay, in 3006.0 the default is set to False, try setting this to False, of use 3006.0 minions to see if helps the issue. The Reverse DNS lookup could take up to 5 seconds, significantly affecting latency. enable_gpu_grains: if don’t need GPU hardware grains, set this from the default True to False, will speed up grains. auth_timeout - default is 60 which should be adequate, but could double to see if any affect.

If you could also provide much more of the log, perhaps then I could see what operations are occurring, that may help to narrow down the operations that are causing the timeout. A log file with trace enabled in the minions configuration:

821 # Default: 'warning'
822 log_level: trace
823 
824 # The level of messages to send to the log file.
825 # One of 'garbage', 'trace', 'debug', info', 'warning', 'error', 'critical'.
826 # If using 'log_granular_levels' this must be set to the highest desired level.
827 # Default: 'warning'
828 log_level_logfile: trace
829 
830 # The date and time format used in log messages. Allowed date/time formatting
831 # can be seen here: http://docs.python.org/library/time.html#time.strftime
832 log_datefmt: '%H:%M:%S'
833 log_datefmt_logfile: '%Y-%m-%d %H:%M:%S'

and emailed to me at damurphy@vmware.com would help. You should not post the log file here in case it contains sensitive information, and advise checking that any sensitive information in the log file be redacted, if necessary.

I think adjustment of ‘timeout’ and ‘worker_threads’ in the master config is most likely to have affect for reasons stated above, esp. worker_threads, since this occurs under load, and most probably need more worker_threads to handle the additional load.

@bastian-src Thanks for that, just trying to duplicate environment that has the issue, w.r.t. to debugging and fixing the issue. Puppet - got a little startled 😃 , will see if I can duplicate similar conditions without having to install foreman, if I can, otherwise, get to learn a new tool 😃

Can someone from the salt team let us know if they can what was done to fix this? In my case the whole salt-cloud post configuration is failing because of this problem. I have even approached Redhat support and they could not find out the root cause and this is happening across any OS I configure as the salt master. In short, it is becoming a blocker in my case.