salt: [BUG] 3005 Salt Minion runs into timeout when uploading highstate report
Description
My Salt Minions runs every hour a self-scheduled highstate. When my Master is on high load, some Minions run during that time occasionally in the following error:
[...]
File "salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
The Master log does not mention anything about an incoming connection. I think the Master cannot handle the requests due to its high load and the Minion runs into a timeout while it wants to upload its report. This happens to my Minions independent of their OS (see reports - Ubuntu 22 and Rocky Linux).
Setup
I have a single Master with around 20 Minions. Once a day (around 3am), the Master server runs some upgrades which result in a high load for ~1h. During this time, the described error occurs on multiple Minions.
Minion config
The Minion config which triggers an autmatic highstate evey hour.
master: <my.master>
log_level: warning
autosign_grains:
- autosign_key
grains:
'autosign_key': '<my key>'
schedule:
highstate:
function: state.highstate
minutes: 60
The state is actually irrelevant, it just adapts the motd. The state runs fine, just not when the Master is under high load.
- on-prem machine
- VM (Virtualbox, KVM, etc. please specify): vSphere
- VM running on a cloud service, please be explicit and add details
- container (Kubernetes, Docker, containerd, etc. please specify)
- or a combination, please be explicit
- jails if it is FreeBSD
- classic packaging
- onedir packaging
- used bootstrap to install
Steps to Reproduce the behavior
Debug logs are attached to the version specific details since there are two logs for the corresponding Minions.
Expected behavior If a timeout occurs during the authentication with the Master, I would expect a log message about it and not a stack trace in the logs.
Versions Report
Master (CentOS 7)
salt --versions-report
Salt Version:
Salt: 3005
Dependency Versions:
cffi: 1.14.5
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.11.3
libgit2: Not Installed
M2Crypto: 0.35.2
Mako: Not Installed
msgpack: 0.6.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.20
pycrypto: Not Installed
pycryptodome: Not Installed
pygit2: Not Installed
Python: 3.6.8 (default, Nov 16 2020, 16:55:22)
python-gnupg: Not Installed
PyYAML: 5.4.1
PyZMQ: 18.0.1
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.1.4
System Versions:
dist: centos 7 Core
locale: UTF-8
machine: x86_64
release: 3.10.0-1160.76.1.el7.x86_64
system: Linux
version: CentOS Linux 7 Core
Minion A (Ubuntu 22.04)
salt --versions-report
Salt Version:
Salt: 3005rc1
Dependency Versions:
cffi: 1.14.6
cherrypy: 18.6.1
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.0
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.17
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
Python: 3.9.12 (main, Jul 18 2022, 22:36:53)
python-gnupg: 0.4.8
PyYAML: 5.4.1
PyZMQ: 23.2.0
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: ubuntu 22.04 jammy
locale: utf-8
machine: x86_64
release: 5.15.0-48-generic
system: Linux
version: Ubuntu 22.04 jammy
Full stack trace
/var/log/salt/minion
2022-10-10 03:00:02,350 [salt.crypt :509 ][DEBUG ][72494] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'Minion host name', 'tcp://<Minion ip>:4506')
2022-10-10 03:00:02,373 [salt.state :763 ][DEBUG ][72494] Gathering pillar data for state run
2022-10-10 03:00:02,373 [salt.pillar :71 ][DEBUG ][72494] Determining pillar cache
2022-10-10 03:00:02,374 [salt.crypt :509 ][DEBUG ][72494] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion host name', 'tcp://<minion ip>:4506')
2022-10-10 03:00:03,042 [salt.utils.process:1155][DEBUG ][43847] Subprocess Schedule(name=__mine_interval, jid=20221010030002042109) cleaned up
2022-10-10 03:01:02,441 [salt.utils.schedule:874 ][ERROR ][72494] Unhandled exception running state.highstate
Traceback (most recent call last):
File "salt/utils/schedule.py", line 841, in handle_func
ret["return"] = self.functions[func](*args, **kwargs)
File "salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "salt/loader/lazy.py", line 1228, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "salt/loader/lazy.py", line 1243, in _run_as
return _func_or_method(*args, **kwargs)
File "/opt/saltstack/salt/run/salt/modules/state.py", line 1102, in highstate
st_ = salt.state.HighState(
File "salt/state.py", line 4769, in __init__
self.state = State(
File "salt/state.py", line 768, in __init__
self.opts["pillar"] = self._gather_pillar()
File "salt/state.py", line 833, in _gather_pillar
return pillar.compile_pillar()
File "salt/pillar/__init__.py", line 353, in compile_pillar
ret_pillar = self.channel.crypted_transfer_decode_dictentry(
File "salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "salt/utils/asynchronous.py", line 131, in _target
result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
File "salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "salt/channel/client.py", line 171, in crypted_transfer_decode_dictentry
ret = yield self.transport.send(
File "salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "salt/transport/zeromq.py", line 914, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "salt/transport/zeromq.py", line 624, in send
recv = yield future
File "salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
2022-10-10 03:01:02,462 [salt.utils.event :315 ][DEBUG ][72494] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_e80f3588f1_pub.ipc
2022-10-10 03:01:02,462 [salt.utils.event :316 ][DEBUG ][72494] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_e80f3588f1_pull.ipc
2022-10-10 03:01:02,470 [salt.utils.event :821 ][DEBUG ][72494] Sending event: tag = __schedule_return; data = {'cmd': '_return', 'id': '<Minion host name>', 'fun': 'state.highstate', 'fun_args': [], 'schedule': 'highstate', 'jid': 'req', 'pid': 72494, 'return': 'Unhandled exception running state.highstate', 'success': False, 'retcode': 254, '_stamp': '2022-10-10T03:01:02.470071'}
2022-10-10 03:01:02,474 [salt.minion :2689][DEBUG ][43847] Minion of '<Master host name>' is handling event tag '__schedule_return'
2022-10-10 03:01:02,475 [salt.minion :2191][INFO ][43847] Returning information for job: req
2022-10-10 03:01:02,477 [salt.transport.ipc:373 ][DEBUG ][72494] Closing IPCMessageClient instance
Minion B (Rocky Linux 8)
salt --versions-report
Salt Version:
Salt: 3005
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.6.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.10.1
libgit2: Not Installed
M2Crypto: 0.35.2
Mako: Not Installed
msgpack: 0.6.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: Not Installed
pycryptodome: Not Installed
pygit2: Not Installed
Python: 3.6.8 (default, May 19 2021, 03:00:47)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 20.0.0
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: rocky 8.4 Green Obsidian
locale: UTF-8
machine: x86_64
release: 4.18.0-305.3.1.el8_4.x86_64
system: Linux
version: Rocky Linux 8.4 Green Obsidian
Full stack trace
/var/log/salt/minion
2022-10-13 04:30:48,431 [salt.utils.schedule:874 ][ERROR ][90713] Unhandled exception running state.highstate
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/salt/utils/schedule.py", line 841, in handle_func
ret["return"] = self.functions[func](*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1228, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
return callable(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1243, in _run_as
return _func_or_method(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/modules/state.py", line 1110, in highstate
initial_pillar=_get_initial_pillar(opts),
File "/usr/lib/python3.6/site-packages/salt/state.py", line 4778, in __init__
initial_pillar=initial_pillar,
File "/usr/lib/python3.6/site-packages/salt/state.py", line 768, in __init__
self.opts["pillar"] = self._gather_pillar()
File "/usr/lib/python3.6/site-packages/salt/state.py", line 833, in _gather_pillar
return pillar.compile_pillar()
File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 355, in compile_pillar
dictkey="pillar",
File "/usr/lib/python3.6/site-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "/usr/lib/python3.6/site-packages/salt/utils/asynchronous.py", line 131, in _target
result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 173, in crypted_transfer_decode_dictentry
timeout=timeout,
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 914, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 624, in send
recv = yield future
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 30 (21 by maintainers)
OP confirmed to close the issue IF other users could not confirm it was a problem or provide logs to support the fact that it is still a problem.
I confirmed it’s still a problem and provided additional logs.
Here’s more evidence #64176 won’t fix it.
TCP Listener for the salt master
:4506port doesn’t respond with a simple TCP connection/close attempt usingnc:Note: on a healthy salt-master this is nearly instantaneous, even though
ncincorrectly reports “connection refused”:Here’s a
SaltReqTimeoutError: Message timed outgenerated from asalt-call test.ping- which I’m quite confident does not use any pillar functionality 😃It took nearly 8 months to get someone to pay attention to this issue… please don’t make us open another one for the exact same problem because this was closed prematurely.
If the master is under high load then you will get timeouts, yes.
@tjyang your first example above there is actually working, it attempts to connect to an IPv6 localhost (
::1) which doesn’t have an open port, but succeeds when connecting to the IPv4 version of localhost (Ncat: Connected to 127.0.0.1:4506.).Another (perhaps the best) indication of this problem is a high Recv-Q value as shown in the
ss(ornetstat) command output, for example:Shows my salt-master PUB_RET :4506 Recv-Q at
4097(it should be zero or very close to it).My buddy ChatGPT explains Recv-Q quite well:
Recv-Q explanation
The “Recv-Q” column in the output of the Linux “netstat” command represents the amount of data (measured in bytes) that has been received by the network stack but has not yet been read by the application. In other words, it is the amount of data queued in the receive buffer, waiting to be processed by the application.
A non-zero value in the Recv-Q column might not necessarily indicate a problem, as it could be a transient situation where data is being received faster than the application can process it. However, if the Recv-Q value is consistently high or increasing over time, it may indicate an issue with the application, such as slow processing or being blocked due to other reasons.
To troubleshoot high or increasing Recv-Q values, you can take the following steps:
Identify the process: Use the “netstat” command with the “-p” flag to identify the process associated with the socket showing the high Recv-Q value. This will help you understand which application is experiencing the issue.
Monitor application performance: Check the CPU and memory usage of the application to see if it is under heavy load, causing it to process incoming data slowly. You can use tools like “top” or “htop” to monitor system resource usage.
Check application logs: Inspect the application logs to look for any errors or indications of slow processing. This may help you identify the cause of the issue.
Increase buffer sizes: If the application is working as expected, but the Recv-Q is still high, consider increasing the socket receive buffer size using the “sockopt” function with the SO_RCVBUF option. This can help accommodate higher amounts of incoming data.
Optimize application code: If the application is struggling to keep up with incoming data, optimize its code to improve performance. This may involve refactoring or rewriting parts of the application to process data more efficiently.
Scale the application: If the application cannot be optimized further and is consistently receiving more data than it can handle, consider scaling the application horizontally by adding more instances or vertically by allocating more resources to the application.
Remember that occasional non-zero Recv-Q values are normal and not always a cause for concern. However, consistently high or increasing values warrant investigation and possible action to resolve the underlying issue.
Also seeing this regularly after updating to 3005 (master and minion). Does not appear to affect 3004 minions talking to 3005 master. Causes minions to lose ‘connected’ status with the master, so master cannot send commands to the minion. The minion however can still send commands to the master. eg. the minion can
saltutil.sync_alljust fine, but the same command run on the master produced a job that never returns. Environment: master: 3005.1 CentOS 7minions: 3005.1 various (e.g. AlmaLinux 8, CloudLinux 8, CentOS 7, CloudLinux 7)
@rrrix The originator of the issue said it was okay to close, and the fix in #64176 resolves the saltutils.sync_all issue that @tjyang had. If memory serves me correctly, I believe the highstate will cause a refresh of grains, which at present will clear the pillar cache, which #64176 fixes.
Salt 3006.1 is very soon, hence I will leave this closed, but if 3006.1 does not resolve your issues, please open a new issue with your conditions to reproduce the problem. And yes, it is well-known that “render pillar cache” is time consuming, that is why it is recommended to limit the amount of data in pillars.
@tjyang Your command to make the timeout occur
Suggests the upcoming fix https://github.com/saltstack/salt/pull/64176 which addresses an issue where a
sync_allwould clear the pillar cache, thus causing it to have to be rebuilt, which is a time-consuming operation. The fix should be available in 3006.1 which will be released soon.@bastian-src Thanks for the information, given @tjyang 's additional information, I believe https://github.com/saltstack/salt/pull/64176 fixes the issue encountered, hence closing this issue as you suggested.
I don’t see this now because we had to throw resources at the master to make it go away. I don’t believe it has gone away by other means.
@bastian-src
I use this command “sudo time salt-call saltutil.sync_all;date”, to identify timeout issue. Currently I don’t see timeout message on my salt masters.
@dmurphy18 and @cmcmarrow thanks a lot for your comments!
I’ve created this issue since the error produced an unexpected error in our tool which it couldn’t handle (have a look here). I’ve added a PR to enable it handle the error and afterwards, I didn’t have a look whether it actually occurs anymore.
That being said, I just checked the log files of several Minions, but none had the issue anymore. No update of any Salt package (Master/Minion) happened since then - so, I assume maybe the Master hasn’t been under such high load again.
In case it is still an issue for other users @afletch @tjyang @RobinWlund, maybe they can provide logs as @dmurphy18 described. If not, I think the issue can be closed.
Thanks @dmurphy18 and @cmcmarrow for the hints in your posts and for the extensive explanation!
Hi @bastian-src do you have pillar cache enabled? If so #64081 might be your fix. Any and all high states will clear the pillar cache when ran which takes up alot of CPU and MeM for the master.
Also this issue which is being fixed https://github.com/saltstack/salt/issues/64081 which may be the reason for the timeouts occurring also, since a refresh_grains will also clear the pillar data, which means it has to be recompiled which is a time consuming operation, leading to a master getting over whelmed if having to do so for many minions.
@bastian-src Before digging heavily into this, thought I’d check if you tried adjusting some configuration values to see if they help with the issues encountered:
Master Config: worker_threads - default is 5, have you tried to double this or some other value to see if it makes a difference. gather_job_timeout - default is 10, could double it to see if it makes a difference timeout - default is 5 seconds, can you try doubling and see if any affect. In the past I have had to increase this value, when building the classic packaging which used Salt state and orchestration to generate OS VMs on AWS, build the various Salt release and then destroy the VMs used. Which is similar to the use-case here.
Minion Config: enable_fqdns_grains - prior to 3006.0, this defaulted True, which could induce a delay, in 3006.0 the default is set to False, try setting this to False, of use 3006.0 minions to see if helps the issue. The Reverse DNS lookup could take up to 5 seconds, significantly affecting latency. enable_gpu_grains: if don’t need GPU hardware grains, set this from the default True to False, will speed up grains. auth_timeout - default is 60 which should be adequate, but could double to see if any affect.
If you could also provide much more of the log, perhaps then I could see what operations are occurring, that may help to narrow down the operations that are causing the timeout. A log file with trace enabled in the minions configuration:
and emailed to me at damurphy@vmware.com would help. You should not post the log file here in case it contains sensitive information, and advise checking that any sensitive information in the log file be redacted, if necessary.
I think adjustment of ‘timeout’ and ‘worker_threads’ in the master config is most likely to have affect for reasons stated above, esp. worker_threads, since this occurs under load, and most probably need more worker_threads to handle the additional load.
@bastian-src Thanks for that, just trying to duplicate environment that has the issue, w.r.t. to debugging and fixing the issue. Puppet - got a little startled 😃 , will see if I can duplicate similar conditions without having to install foreman, if I can, otherwise, get to learn a new tool 😃
Can someone from the salt team let us know if they can what was done to fix this? In my case the whole salt-cloud post configuration is failing because of this problem. I have even approached Redhat support and they could not find out the root cause and this is happening across any OS I configure as the salt master. In short, it is becoming a blocker in my case.