salt: Minions randomly becoming unresponsive, sometimes stuck at 100% CPU usage
Description of Issue/Question
I’ve been noticing that a number of my minions will stop responding after being left alone for a while. This did not start with a upgrade of Salt, a new state, or any other change I can attribute to it. There does not appear to be any one action which causes the minions to stop responding, they just do.
Setup
I have over 80 minions which are all identically configured (they are compute nodes in an HPC cluster). If we begin at a starting point of all minions up and running, after a few days I begin to see this when running a state or doing anything else via Salt:
cn32:
Minion did not return. [No response]
cn66:
Minion did not return. [No response]
cn59:
Minion did not return. [No response]
cn68:
Minion did not return. [No response]
… with more and more minions having the problem the longer time goes on. When I SSH to the system to investigate I sometimes, but not always, find a salt-minion process stuck at 100% CPU. That is a major problem in my environment and is a more pressing matter than being unable to communicate with a minion.
Steps to Reproduce Issue
- Start a minion.
- Wait.
Versions Report
# salt --versions-report
Salt Version:
Salt: 2016.3.1
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 1.5
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 3.10.0-327.4.5.el7.x86_64
system: Linux
version: CentOS Linux 7.2.1511 Core
Please let me know what more you need. This is a major issue for me and at the moment SaltStack is unusable and is harming my systems (minion stuck at 100% == the scientific workloads of my machines can’t work appropriately).
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 15 (7 by maintainers)
We just saw this happening in production when a minion fails over. Is anybody who sees this running minions with multiple masters?