salt: [BUG] intermittent connection between master and minion
Description
A clear and concise description of what the bug is.
I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn’t connect to them anymore after a while. salt '*' test.ping failed with the following error message:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230920213139507242
here are a few observations:
- restarting the salt-minion service helped but the same minion lost connection again after a while.
salt-call test.pingworks fine on minion side. other commands likesalt-call state.applyalso works fine. this indicates minion to master communication is fine but master to minion communication is not- below is the error message i found from minion log. i tried to bump up the timeout param like salt ‘*’ -t 600 test.ping . but it doesn’t help
2023-09-20 13:00:08,121 [salt.minion :2733][ERROR ][821760] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20230920195941006337', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2023-09-20T19:59:41.114944', 'nonce': '3b23a38761fc4e98a694448d36ac7f97'} request
does anyone have any idea what's wrong here and how to debug this issue?
Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
- minion was installed by
sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3. no custom config on minion - master runs inside a container using image
saltstack/salt:3006.3. master configs:
nodegroups:
prod-early-adopter: L@minion-hostname-1
prod-general-population: L@minion-hostname-2
release: L@minion-hostname-3
custom: L@minion-hostname-4
file_roots:
base:
- <path/to/custom/state/file>
state file:
pull_state_job:
schedule.present:
- function: state.apply
- maxrunning: 1
- when: 8:00pm
deploy:
cmd.run:
- name: '<custom-command-here>'
- runas: ubuntu
Please be as specific as possible and give set-up details.
- on-prem machine
- VM (Virtualbox, KVM, etc. please specify)
- VM running on a cloud service, please be explicit and add details
- container (Kubernetes, Docker, containerd, etc. please specify)
- or a combination, please be explicit
- jails if it is FreeBSD
- classic packaging
- onedir packaging
- used bootstrap to install
Steps to Reproduce the behavior (Include debug logs if possible and relevant)
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Salt Version:
Salt: 3006.3
Python Version:
Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]
Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.2
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: Not Installed
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: alpine 3.14.6
locale: utf-8
machine: x86_64
release: 5.11.0-1022-aws
system: Linux
version: Alpine Linux 3.14.6
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: open
- Created 9 months ago
- Reactions: 3
- Comments: 19 (8 by maintainers)
This seems to still be an issue on 3006.7 when both minion and master are the same version
@darkpixel Yes I have found the same thing and have the same workflow. Something just gets stuck and responses get lost somewhere. They are always receiving events however as you say, in my experience