salt: Inconsistent pillar data from git_pillar
Description of Issue/Question
I am using git_pillar as the only pillar provider, with “dynamic” environments (__env__
instead of hardcoded branch name). The way I understand it, this setup allows you to fork a branch in your Git repo and change values in Pillar so that only minions in the same pillarenv
see that change.
Unfortunately, pillar representation is flapping between different branches; here’s a minion in environment: dev
and pillarenv: dev
to whose pillar I added a new key with a value that occasionally disappears, either when getting the key from master or from the minion. In the dev
branch this key exists with the value you see in first two attempts; in another branch production
the key doesn’t exist.
(dev) root@node-3:~$ salt-call pillar.item ovs_bridges:br-ex-vdc
local:
----------
ovs_bridges:br-ex-vdc:
vlan430
(dev) root@node-3:~$ salt-call pillar.item ovs_bridges:br-ex-vdc
local:
----------
ovs_bridges:br-ex-vdc:
vlan430
(dev) root@node-3:~$ salt-call pillar.item ovs_bridges:br-ex-vdc
local:
----------
ovs_bridges:br-ex-vdc:
I suspect this might be caused by a race between simultaneous pillar renderings on the master (since Git pillar data is saved on the master in the form of a single directory in which checkouts are performed), but have not verified it.
There is no pillar data merging configured, because I’d like to have pillar data between environments completely separated:
...
top_file_merging_strategy: same
pillar_source_merging_strategy: none
...
Maybe it would help if the Pillar repository on salt master got spread into different directories (one for each branch) instead of checking out inside one directory. Otherwise we have to be sure that all operations that can potentially perform git checkout
obey a lock set by the process that is currently compiling the pillar.
Steps to Reproduce Issue
- Install Salt 2016.11.2 on both master and minion, configure new-style Git ext_pillar with dynamic
__env__
mapping. I’m usingpygit2
right now, but the issue was identical while usinggitpython
:
git_pillar_global_lock: False
git_pillar_provider: pygit2
git_pillar_privkey: /root/.ssh/id_rsa
git_pillar_pubkey: /root/.ssh/id_rsa.pub
ext_pillar:
- git:
- __env__ ssh://git@git.domain.tld/group/salt.git:
- root: pillar
- Configure the minion to use
environment
andpillarenv
options, both initially set to the default environmentbase
corresponding tomaster
branch. - In the default branch, create a Top file for pillar that looks like this:
{{ saltenv }}:
node-3:
- foo
- In the default branch, create a Pillar file
foo.sls
containing some data - Fork another branch, for example
dev
, off the default one, and modifyfoo.sls
so that it has a superset ofdefault
’s data (to rule out merging issues). - Spawn another minion, this time with
environment
andpillarenv
set todev
- Run
salt-call pillar.items
on both of them, preferably in a “for” loop to achieve concurrency. The results should be inconsistent between calls on the same minion.
Versions Report
Master:
Salt Version:
Salt: 2016.11.2
Dependency Versions:
cffi: 1.5.2
cherrypy: Not Installed
dateutil: 2.4.2
gitdb: 0.6.4
gitpython: 1.0.1
ioflo: Not Installed
Jinja2: 2.8
libgit2: 0.24.0
libnacl: Not Installed
M2Crypto: Not Installed
Mako: 1.0.3
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: 1.3.7
pycparser: 2.14
pycrypto: 2.6.1
pygit2: 0.24.0
Python: 2.7.12 (default, Nov 19 2016, 06:48:10)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.2.0
RAET: Not Installed
smmap: 0.9.0
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: Ubuntu 16.04 xenial
machine: x86_64
release: 4.4.0-24-generic
system: Linux
version: Ubuntu 16.04 xenial
Minion:
Salt Version:
Salt: 2016.11.2
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.4.2
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: 0.9.1
msgpack-pure: Not Installed
msgpack-python: 0.4.7
mysql-python: 1.2.3
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.0.1
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.4
System Versions:
dist: Ubuntu 14.04 trusty
machine: x86_64
release: 3.19.0-58-generic
system: Linux
version: Ubuntu 14.04 trusty
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 81 (65 by maintainers)
Commits related to this issue
- git_pillar: Avoid checkout to improve performance and fix inconsistency Fixes: #39420 This also add several features from gitfs: - git_pillar_saltenv_whitelist / git_pillar_saltenv_blacklist - git_p... — committed to sathieu/salt by sathieu 4 years ago
- git_pillar: Avoid checkout to improve performance and fix inconsistency Fixes: #39420 This also add several features from gitfs: - git_pillar_saltenv_whitelist / git_pillar_saltenv_blacklist - git_p... — committed to sathieu/salt by sathieu 4 years ago
- git_pillar: Avoid checkout to improve performance and fix inconsistency Fixes: #39420 This also add several features from gitfs: - git_pillar_saltenv_whitelist / git_pillar_saltenv_blacklist - git_p... — committed to sathieu/salt by sathieu 4 years ago
- git_pillar: Avoid checkout to improve performance and fix inconsistency Fixes: #39420 This also add several features from gitfs: - git_pillar_saltenv_whitelist / git_pillar_saltenv_blacklist - git_p... — committed to Vaarlion/salt by sathieu 4 years ago
What is actually missing here which could prevent progress? Does anyone know if the patch is still needed and/or functional with salt 3004?
one of the ideas ive had about this is change it so that each dynamic env gets its own git folder to cache data rather than one single git folder where its branch switching constantly as requests for different envs come in
that would
OK, I’m fairly certain I know what is going on. The checkout happens here, while the Pillar compilation happens later (here). The checkout lock is only preventing a simultaneous attempt at checking out the repo, but once the checkout is complete, the lock is removed. So minion A checks out its branch, but between when minion A checks it out and compiles the pillar SLS, minion B has separately checked out a different branch.
To fix this, the locking must be more intelligent and last until Pillar compilation is complete. This means we need to have a way for minions which need the same branch to be able to keep the lock alive so that we’re not having each invocation of git_pillar block. So essentially, if any minions need the
master
branch at a given time, the lock must remain in place. This would probably entail creating lock files named in the formatcheckout.<branch_or_tag_name>.<minion_id>.lk
instead ofcheckout.lk
. The contextmanager which obtains the checkout lock should also be invoked from within the git_pillar code and not thecheckout()
func insalt.utils.gitfs
.https://github.com/saltstack/salt/pull/65017
Changes:
../salt/../hash/branch
- Done../salt/../hash/branch/.git/.salt
fetch_on_fail
checkout(self, fetch_on_fail=True)
- Doneclean_old_remotes
- Done__env__
Gitbase.fetch_remotes
is called the current dynamic branch will fetch and all its sister branches will have a fetch_request file placed in its salt working dir. When sister branch inits or goes to checkout it will see the fetch_request and runfetch
TODO later PR:
remove old remotes branch’s/tags
clear_old_remotes()
will remove and remote branch that has out lived its time to live.I would appreciate it if you guys give it a try and/or look.
I doubt it’s related, but I was getting a Template does not exist error because pillar/init.py was computing the top cache as an empty string while using pillarenv in the minion config. I submitted a fix as #39516. Figured I’d mention it as this is the only sortof recent issue with “template does not exist” in it.
https://saltproject.io/security-announcements/2023-08-10-advisory/
https://github.com/saltstack/salt/commit/6120bcac2ee79b6e8b104612941432841eb0c8c3
I have the same idea. It takes more spaces but accelerate the process. The problem is how to cleanup old branches. I think it is important to clear directories of old branches after a certain period.
@Vaarlion nice catch
it would look like this issue only occurs on the master when running a salt-call Any other minion don’t have that issue during a salt-call, and the master don’t have issue when i use salt to target him directly
If someone pick this up, i hop this would help.
Worktrees weren’t the way to go. Adding a global lock on the ext_pillar function to make it atomic solves the problem.
@peter-slovak we actually do this already for gitfs, using the Python bindings rather than the CLI. The trouble with this is that git_pillar still invokes the
salt.pillar.Pillar
class (used also for the conventionalpillar_roots
data) to parse the top file and load the pillar matches. A hybrid option, where the top file is parsed first, and then the pillar matches are extracted directly from the refs, may work. But doing this separately for each minion may cause its own performance issues.I’ll give this some more thought, and discuss with my fellow engineers.
There is this error in the master log, but not explicitly tied to gitfs locking - although suspicious enough:
The other is a warning, presumably because
git_pillar_global_lock
is set toFalse
(but setting it toTrue
changes nothing in the behavior):Besides that, I ran some tests on two minions:
api-1
is in environmentdev
and has the pillar itemrole: internal
setapi-2
is in environmentdev2
and has the pillar itemrole: foobar
setsaltutil.refresh_pillar
These are the results:
So that definitely looks like a race. Now get this:
Unfortunately I don’t have an in-depth knowledge of how the master handles simultaneous pillar compilation for more minions, but my best guess is that some stray checkout doesn’t adhere to the lock you mentioned. I’ll do my best to look into git_pillar implementation in the next few days and see if we can easily mimick the behavior of gitfs for state files; that is having a separate directory for each dynamic environment. Thanks for any help until then.