salt: Multiple returns from minions when multi-master is used.

Greetings,

I have looked at a number of issues like this on here and other sites and have not seen a true solution to the issue, instead most seem to focus on the different symptoms and not the under lying cause… Unfortunately, that means none of the existing solutions has really fixed the problem. We first saw this issue in Salt 2014.1.10 and continue to see it in 2015.8.x

So the basic problem: When you have multi-master configured, per the Salt documentation, you can get into a situation where each Salt command generates multiple returns for random minions. A simple example of this;

jwells@saltsyndic:~$ salt \* test.ping
random-minion-001: true
random-minion-005: true
random-minion-005: true
random-minion-003: true
random-minion-001: true
random-minion-002: true

When we run the same command from out Salt M-O-M (Master-Of-Masters) cluster this effect is consistent for the Salt Syndic / Master boxes and more random for the minions.

Our original structure was;

1 x Salt M-O-M (Running Ubuntu, Salt Master, Salt Minion) – CNAME saltmaster
2 x Salt Syndic / Master (Running Ubuntu, Salt Master, Salt Minion, and Salt Minion) – CNAME salt
~750 Salt Minion (Running Ubuntu and Salt Minion)

The Salt Minion nodes were configured to connect to both of the Salt Syndic / Master nodes using a single DNS CNAME and when that didn’t work we converted to a VIP (Same result and fail-over was unstable);

master: salt

We found a discussion somewhere that indicated that the minion config needed to use two different names (Same result, though it failed over gracefully);

master:
  salt-001
  salt-002

In another discussion we saw that it was caused by multiple instances of Salt Syndic running, so we created a script to periodically kill all instances and restart it. Same result.

Another discussion stated that it was because we didn’t have master_id set, or because we had order_masters, or because we didn’t have it set… At that point, we went to a single Salt M-O-M, and a single Salt Syndic / Master at each site.

We recently started the process of upgrading to Salt 2015.8.x and, due to other issues, we decided to go back to the multi-master configuration. And see this issue is still present. 😦

Our setup;

2 x Salt M-O-M (Running Ubuntu, Salt Master, Salt Minion, and Gluster to share /etc/salt/pki/master)
2 x Salt Syndic / Master (Running Ubuntu, Salt Master, Salt Minion, Salt Syndic, and Gluster to share /etc/salt/pki/master)
N x Salt Minion (Running Ubuntu, Salt Minion)

The Salt M-O-M, are both running the same versions / configurations as verified by Salt and md5sum. The relevant portions of the Salt Master configurations are (minion_id == saltmom-001 or saltmom-002 respectively);

order_masters: true
master_id: <minion_id value>

Just for completeness, here is the relevant portion of the Salt Minion configuration;

master:
  - saltmom-001
  - saltmom-002

The Salt Syndic / Master are near identical to the Salt M-O-M configuration, but they have an extra entry for Syndic (minion_id == salt-001 or salt-002 respectively);

order_masters: true
master_id: <minion_id value>

syndic_master:
  - saltmom-001
  - saltmom-002

And the relevant portion of the Salt Minion configuration;

master:
  - salt-001
  - salt-002

With just this configuration in place, if I do a ‘salt * test.ping’, from either of the Salt M-O-M boxes, I would expect to get back something like the following;

jwells@saltmom-001:~$ salt \* test.ping --output=yaml
saltmom-001: true
saltmom-002: true
salt-001: true
salt-002: true

Instead what I get back is something like;

jwells@saltmom-001:~$ salt \* test.ping --output=yaml
saltmom-001: true
salt-001: true
saltmom-002: true
salt-001: true
salt-002: true
salt-002: true

And Salt Master logs on the Salt M-O-M show something like;

2015-11-11 11:25:12,535 [salt.utils.job   ][INFO    ][12496] Got return from salt-002 for job 20151111112511812434
2015-11-11 11:25:12,577 [salt.utils.job   ][INFO    ][12513] Got return from salt-002 for job 20151111112511812434
2015-11-11 11:25:12,579 [salt.loaded.int.returner.local_cache][ERROR   ][12513] An extra return was detected from minion salt-002, please verify the minion, this could be a replay attack

Now, if we add minions to the Salt Syndic / Master boxes, we will get the same duplicate responses on the Salt M-O-M, but not if we call it from the Salt Syndic / Master boxes. For completeness, their relevant configuration is;

master:
  - salt-001
  - salt-002

When we look at the Salt Master logs on the Salt Syndic / Master, we see the following;

2015-11-11 11:25:12,535 [salt.minion      ][INFO    ][27256] Returning information for job: 20151111112511812434
2015-11-11 11:25:12,571 [salt.minion      ][INFO    ][27256] Returning information for job: 20151111112511812434

So the Salt Syndic is returning the correct information, but then if we look at the other Salt Syndic / Master’s syndic log file we find the same jobs being returned. So the Salt M-O-M get’s two copies of the same return, marks one of them as “An extra return” and will display it in the output.

Finally, I should point out that this is not completely consistent. We do get the occasional entry like this;

jwells@saltmom-001:~$ salt \* test.ping --output=yaml
saltmom-001: true
salt-001: Minion did not return. [Not connected]
saltmom-002: true
salt-001: true
salt-002: true
salt-002: true

All of the nodes in the new environment have the same versions of Ubuntu packages, Python, and Salt;

jwells@saltmom:~$ sudo salt-run --versions-report
Salt Version:
             Salt: 2015.8.0

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: Not Installed
           Mako: 0.9.1
         PyYAML: 3.11
          PyZMQ: 14.4.0
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: Not Installed
  pycparser: Not Installed
   pycrypto: 2.6.1
     pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-66-generic
         system: Ubuntu 14.04 trusty

About this issue

Original URL
State: closed
Created 9 years ago
Comments: 17 (10 by maintainers)

Commits related to this issue

Add some unit tests for the jid_queue functionality in minion.py Refs #35172 and #28785 — committed to rallytime/salt by deleted user 8 years ago

Most upvoted comments

Yes. The issue is that the minion doesn’t keep track of jids it has already executed. It just executes a job every time it receives a job. So when you pub a job from the master of masters through two syndic masters, with a minion underneath connected to both, the minion receives to jobs, and so it runs and returns two jobs.

This is a fairly common use case, so we need to at some point add this filtering mechanism, I think. I’m pretty sure we already have an issue open for it somewhere, but I can’t find it off the top of my head.

basepi on Nov 18, 2015