salt: daily highstate fails after 2019.2 upgrade
After upgrading masters and minions to 2019.02, observed this behavior:
- daily scheduled highstate fails to run UNLESS a manual pillar has been run recently (last 12 hours or so), but this doesn’t always make it run
- when the highstate runs, if there are any failures, I receive 20 or so reports of failure per minion via the returner, as opposed to just one with 2018.03.4; it appears that the minion is trying state.apply over and over again for about two hours, after which it stops
This happens on two different masters.
Here’s is a pillar/schedule.sls for one of the masters:
schedule:
highstate:
function: state.highstate
when: 4:30am
splay: 600
maxrunning: 1
returner: highstate
Downgrading salt and salt-minion on the minions fixed the problem.
Here is one of the upgraded minions:
Salt Version:
Salt: 2019.2.0
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: 0.31.0
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: 1.2.5
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.5 (default, Apr 9 2019, 14:30:50)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: centos 7.6.1810 Core
locale: UTF-8
machine: x86_64
release: 3.10.0-957.12.2.el7.x86_64
system: Linux
version: CentOS Linux 7.6.1810 Core
Here is the master:
Salt Version:
Salt: 2019.2.0
Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: 0.26.3
libnacl: Not Installed
M2Crypto: 0.31.0
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: 1.2.5
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: 0.26.4
Python: 2.7.5 (default, Apr 9 2019, 14:30:50)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: centos 7.6.1810 Core
locale: UTF-8
machine: x86_64
release: 3.10.0-957.12.2.el7.x86_64
system: Linux
version: CentOS Linux 7.6.1810 Core
Thanks.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 38 (8 by maintainers)
Was not able to test today. Will try to have a test result by Monday.
Alright. I’m hoping this is somewhat helpful or at least might give us a direction to move in.
recap: all minions that server one master and that master have been upgraded to the latest version. Yesterday I set three minions to run in the foreground and the master.
One of my states that runs at highstate is to make sure that the salt-minion service is running, so on the three minions I was running in the foreground, the highstate ran and told the salt-minion service to start. So those boxes now look like this:
Drilling down a bit, on this particular host, we have another job scheduled, not highstate, to run every morning, but it started this morning and has kept running every few minutes. Here’s the pillar schedule file:
And here’s the debug data for what has been running on that minion every few mintues since this morning:
I don’t know (a) why that scheduled job keeps running every few minutes, (b) if it could be caused by having more than one minion process running, or © if this is at all related to the original problem for which this issue was opened. On some other minions on which I am not running in the foreground, I can see a highstate run that repeated over and over again this morning, even though there were no failures in the run. The salt-minion logs on those minions are empty, so I don’t have any data to report from them. It seems like it could be a similar problem, though, to the one for which I /do/ have debug data here, even though this is not for a highstate run.
Hope this helps somehow. Let me know what to try next. Thanks much!
Catching up on this and I might have been able to reproduce it. Need to do a bit more testing.
I’m going to try and reproduce the issue with what we got, I’ll let you know if I need more info, thank you!
I am now testing with a 24 hour interval (debug level logging).
Thanks @H20-17 for all the work.
I need to upgrade my master to the latest update, which I will do today.
I’ll upgrade a bunch of minions, too, and let you know what happens.
In my experience, running the highstate oftener than every 8-12 hours will keep the bug from showing up. So I’m going to leave it at every 24 hours for a few days.
Thanks, @Akm0d for putting this at the top of your list.
I’m running a highstate every 15 minutes now on a test minion and I’m failing to reproduce the issue (schedules from pillar on the master). WHen I was experiencing this I was running a highstate just once a week on every mininon (also scheduled from pillar on the master). When it was happening, it would happen on just about every Windows minion and all it took was the one scheduled highstate. I will keep my test minion highstating all night (every 15 minutes). It’s possible that the issue was inadvertantly resolved due to changes (other bug fixes perhaps?).
I’l let you guys know if I find anything. For the time being everything is working. (I’m now fully up to date with 2019.2.3 (py3) on both the master and the Windows minion.)
I will look into this too. I havent tried running scheduled high states since 2019.2.0 because of this but I’ll see what happens now
Thanks, @Akm0d . Sorry for not clarifiying–the errors were state.apply failures, not errors per se.
I should note that this has followed me into CentOS 7.7, and persists even with the latest 2019.2 versions.