salt: Inconsistent pillar data from git_pillar

Description of Issue/Question

I am using git_pillar as the only pillar provider, with “dynamic” environments (__env__ instead of hardcoded branch name). The way I understand it, this setup allows you to fork a branch in your Git repo and change values in Pillar so that only minions in the same pillarenv see that change.

Unfortunately, pillar representation is flapping between different branches; here’s a minion in environment: dev and pillarenv: dev to whose pillar I added a new key with a value that occasionally disappears, either when getting the key from master or from the minion. In the dev branch this key exists with the value you see in first two attempts; in another branch production the key doesn’t exist.

(dev) root@node-3:~$ salt-call pillar.item ovs_bridges:br-ex-vdc                                                                                                                        
local:
    ----------
    ovs_bridges:br-ex-vdc:
        vlan430
(dev) root@node-3:~$ salt-call pillar.item ovs_bridges:br-ex-vdc
local:
    ----------
    ovs_bridges:br-ex-vdc:
        vlan430
(dev) root@node-3:~$ salt-call pillar.item ovs_bridges:br-ex-vdc
local:
    ----------
    ovs_bridges:br-ex-vdc:

I suspect this might be caused by a race between simultaneous pillar renderings on the master (since Git pillar data is saved on the master in the form of a single directory in which checkouts are performed), but have not verified it.

There is no pillar data merging configured, because I’d like to have pillar data between environments completely separated:

...
top_file_merging_strategy: same
pillar_source_merging_strategy: none
...

Maybe it would help if the Pillar repository on salt master got spread into different directories (one for each branch) instead of checking out inside one directory. Otherwise we have to be sure that all operations that can potentially perform git checkout obey a lock set by the process that is currently compiling the pillar.

Steps to Reproduce Issue

  • Install Salt 2016.11.2 on both master and minion, configure new-style Git ext_pillar with dynamic __env__ mapping. I’m using pygit2 right now, but the issue was identical while using gitpython:
git_pillar_global_lock: False
git_pillar_provider: pygit2
git_pillar_privkey: /root/.ssh/id_rsa
git_pillar_pubkey: /root/.ssh/id_rsa.pub

ext_pillar:
  - git:
    - __env__ ssh://git@git.domain.tld/group/salt.git:
      - root: pillar
  • Configure the minion to use environment and pillarenv options, both initially set to the default environment base corresponding to master branch.
  • In the default branch, create a Top file for pillar that looks like this:
{{ saltenv }}:
  node-3:
    - foo
  • In the default branch, create a Pillar file foo.sls containing some data
  • Fork another branch, for example dev, off the default one, and modify foo.sls so that it has a superset of default’s data (to rule out merging issues).
  • Spawn another minion, this time with environment and pillarenv set to dev
  • Run salt-call pillar.items on both of them, preferably in a “for” loop to achieve concurrency. The results should be inconsistent between calls on the same minion.

Versions Report

Master:

Salt Version:
           Salt: 2016.11.2
 
Dependency Versions:
           cffi: 1.5.2
       cherrypy: Not Installed
       dateutil: 2.4.2
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: 0.24.0
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 1.0.3
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.3.7
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: 0.24.0
         Python: 2.7.12 (default, Nov 19 2016, 06:48:10)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4
 
System Versions:
           dist: Ubuntu 16.04 xenial
        machine: x86_64
        release: 4.4.0-24-generic
         system: Linux
        version: Ubuntu 16.04 xenial

Minion:

Salt Version:
           Salt: 2016.11.2
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.4.2
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.7
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.0.1
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
 
System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.19.0-58-generic
         system: Linux
        version: Ubuntu 14.04 trusty

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 81 (65 by maintainers)

Commits related to this issue

Most upvoted comments

What is actually missing here which could prevent progress? Does anyone know if the patch is still needed and/or functional with salt 3004?

one of the ideas ive had about this is change it so that each dynamic env gets its own git folder to cache data rather than one single git folder where its branch switching constantly as requests for different envs come in

that would

  • a speed up pillar lookups across dynamic envs as it doesnt need to swap between branches
  • b stop the inconsisent data as it will be looking at a git cache folder specific to that branch/ref

OK, I’m fairly certain I know what is going on. The checkout happens here, while the Pillar compilation happens later (here). The checkout lock is only preventing a simultaneous attempt at checking out the repo, but once the checkout is complete, the lock is removed. So minion A checks out its branch, but between when minion A checks it out and compiles the pillar SLS, minion B has separately checked out a different branch.

To fix this, the locking must be more intelligent and last until Pillar compilation is complete. This means we need to have a way for minions which need the same branch to be able to keep the lock alive so that we’re not having each invocation of git_pillar block. So essentially, if any minions need the master branch at a given time, the lock must remain in place. This would probably entail creating lock files named in the format checkout.<branch_or_tag_name>.<minion_id>.lk instead of checkout.lk. The contextmanager which obtains the checkout lock should also be invoked from within the git_pillar code and not the checkout() func in salt.utils.gitfs.

https://github.com/saltstack/salt/pull/65017

Changes:

  • Place remotes branch/tag in its own folder ../salt/../hash/branch - Done
    • This includes locks and links
  • Clear Minion and Master Cache on salt version change - Done
    • This is so people don’t need to manually clear salt cache when upgrading
  • Add salt working dir ../salt/../hash/branch/.git/.salt
    • This is so we decrease the chances of stepping on git
  • add fetch_on_fail checkout(self, fetch_on_fail=True) - Done
    • This is so gitpillar does not need worry if a fetch is required when working with dynamic branches
  • fix clean_old_remotes - Done
  • fix fetch for __env__
    • add fetch_request - when Gitbase.fetch_remotes is called the current dynamic branch will fetch and all its sister branches will have a fetch_request file placed in its salt working dir. When sister branch inits or goes to checkout it will see the fetch_request and run fetch

TODO later PR:

  • remove old remotes branch’s/tags

    • Make a timestamp file every time a gitprovier object is made in remote branch
    • clear_old_remotes() will remove and remote branch that has out lived its time to live.

    I would appreciate it if you guys give it a try and/or look.

I doubt it’s related, but I was getting a Template does not exist error because pillar/init.py was computing the top cache as an empty string while using pillarenv in the minion config. I submitted a fix as #39516. Figured I’d mention it as this is the only sortof recent issue with “template does not exist” in it.

I have the same idea. It takes more spaces but accelerate the process. The problem is how to cleanup old branches. I think it is important to clear directories of old branches after a certain period.

@Vaarlion nice catch

it would look like this issue only occurs on the master when running a salt-call Any other minion don’t have that issue during a salt-call, and the master don’t have issue when i use salt to target him directly

If someone pick this up, i hop this would help.

Worktrees weren’t the way to go. Adding a global lock on the ext_pillar function to make it atomic solves the problem.

@peter-slovak we actually do this already for gitfs, using the Python bindings rather than the CLI. The trouble with this is that git_pillar still invokes the salt.pillar.Pillar class (used also for the conventional pillar_roots data) to parse the top file and load the pillar matches. A hybrid option, where the top file is parsed first, and then the pillar matches are extracted directly from the refs, may work. But doing this separately for each minion may cause its own performance issues.

I’ll give this some more thought, and discuss with my fellow engineers.

There is this error in the master log, but not explicitly tied to gitfs locking - although suspicious enough:

2017-02-15 23:07:46,460 [salt.template    ][ERROR   ][27564] Template does not exist:

The other is a warning, presumably because git_pillar_global_lock is set to False (but setting it to True changes nothing in the behavior):

2017-02-15 23:07:44,920 [salt.utils.gitfs ][WARNING ][27568] git_pillar_global_lock is enabled and checkout lockfile /var/cache/salt/master/git_pillar/b4092923fbdfd694387b13fcfbb90712d13692542d47e70913049848f672b8a6/.git/checkout.lk is present for git_pillar remote '__env__ ssh://git@git.domain.tld/group/salt.git'. Process 27585 obtained the lock

Besides that, I ran some tests on two minions:

  • minion api-1 is in environment dev and has the pillar item role: internal set
  • minion api-2 is in environment dev2 and has the pillar item role: foobar set
  • pillar has been synced from master with saltutil.refresh_pillar

These are the results:

(dev) root@api-1:~$ for i in {1..20}; do salt-call pillar.get role | grep -q internal && echo 'Pillar correct' || echo 'Race lost!'; done
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Race lost!
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Race lost!
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
(dev2) root@api-2:~$ for i in {1..20}; do salt-call pillar.get role | grep -q foobar && echo 'Pillar correct' || echo 'Race lost!'; done
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Race lost!
Race lost!
Race lost!
Pillar correct
Race lost!
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct
Pillar correct

So that definitely looks like a race. Now get this:

root@saltmaster:~$ salt 'api-*' saltutil.refresh_pillar
api-2:
    True
api-1:
    True
root@saltmaster:~$ salt 'api-*' pillar.item role  # this one is correct
api-1:
    ----------
    role:
        internal
api-2:
    ----------
    role:
        foobar
root@saltmaster:~$ salt 'api-*' saltutil.refresh_pillar
api-2:
    True
api-1:
    True
root@saltmaster:~$ salt 'api-*' pillar.item role
api-1:
    ----------
    role:
        foobar
api-2:
    ----------
    role:
        foobar

Unfortunately I don’t have an in-depth knowledge of how the master handles simultaneous pillar compilation for more minions, but my best guess is that some stray checkout doesn’t adhere to the lock you mentioned. I’ll do my best to look into git_pillar implementation in the next few days and see if we can easily mimick the behavior of gitfs for state files; that is having a separate directory for each dynamic environment. Thanks for any help until then.