salt: salt-run state.orchestrate fails because it tries to run a second task while a first is ongoing

To start off with, this problem seems somehow related to disk speed. We can reliably reproduce the problem on sata backed virtual machines, but not on ssd backed virtual machines. (Two different openstack flavors on for the rest the same hardware.) Probably because it takes longer to execute the task on a sata disk. (There’s a lot of deb installing going on, so this would make sense.)

This is the relevant part of the output:

# salt-run state.orchestrate orchestration.sls
awingu_master:
----------
          ID: common
    Function: salt.state
      Result: True
     Comment: States ran successfully.
     Started: 12:50:39.093757
    Duration: 63217.649 ms
     Changes:   
----------
          ID: dns_server
    Function: salt.state
      Result: False
     Comment: Run failed on minions: awingu
              Failures:
                  awingu:
                      Data failed to compile:
                  ----------
                      The function "state.sls" is running as PID 6334 and was started at 2014, Nov 28 12:50:39.102732 with jid 20141128125039102732
     Started: 12:51:42.312406
    Duration: 630.701 ms

The error doesn’t always occur in that part of the orchestration, sometimes it happens in a later task, but always right after a task which takes a long time to run (several minutes.)

When I check right after the error, I can see that the job referenced is in fact still ongoing. Also issueing salt 'awingu' saltutil.running confirms this. The job will, in time, do its thing and finish properly. This job runs the “common” task preceding the dns_server task. (And which the dns_server task requires!)

So it’s pretty clear that the orchestrate deems the “common” task finished, while in truth it has not.

It looks like a salt execution returns prematurely, before the task actually being finished (I’ve noted this on one or two occasions), but apart from that being a bug, I would expect the salt runner to be smarter and check the job queue?

For us this currently is a major issue, so any recommendations on how to handle this would be extremely welcome. Even a dirty monkeypatch to apply on top of 2014.7.0 would mean the world to us!

About this issue

  • Original URL
  • State: open
  • Created 10 years ago
  • Comments: 19 (12 by maintainers)

Most upvoted comments

ZD-4570

If the issue is related to timeout on ‘saltutil.find_job’ it cat be “work-arounded” by increasing ‘gather_job_timeout’ in master config. We saw the similar issue on virtual environment, when traversing of salt minion cache hangs in ‘find_job’. However, it also looks like defect in client/saltutil/orchestration, because saltstack considers that state.sls execution was succeed, when it failed due to timeout on ‘find_job’.

I can reproduce the issue with simple orchestration state:

# reproduce.sls
do-sleep-1:
  salt.state:
    - tgt: 'minion01'
    - sls:
      - sleep_120


do-sleep-2:
  salt.state:
    - tgt: 'minion01'
    - sls:
      - sleep_120
    - require:
      - salt: do-sleep-1
# sleep_120.sls
do sleep 120:
  cmd.run:
    - name: sleep 120s

I also hacked ‘find_job’ in saltutil.py module to emulate slow execution:

diff --git a/salt/modules/saltutil.py b/salt/modules/saltutil.py
index c0f0f2b..4ba4aa9 100644
--- a/salt/modules/saltutil.py
+++ b/salt/modules/saltutil.py
@@ -814,6 +814,9 @@ def find_job(jid):
     '''
     for data in running():
         if data['jid'] == jid:
+            if data['fun'] == 'state.sls':
+                import time
+                time.sleep(30)
             return data
     return {}

@@ -1004,6 +1007,8 @@ def _exec(client, tgt, fun, arg, timeout, expr_form, ret, kwarg, **kwargs):
             # do not wait for timeout when explicit list matching
             # and all results are there
             break
+
+    log.error("_exec(tgt=%s, fun=%s, arg=%s, timeout=%s) ret: %s", tgt, fun, arg, timeout, fcn_ret)
     return fcn_ret

Orchestration output:

root@salt:/srv/salt# salt-run state.orchestrate reproduce
[ERROR   ] _exec(tgt=minion01, fun=state.sls, arg=['sleep_120'], timeout=None) ret: {}
[ERROR   ] _exec(tgt=minion01, fun=state.sls, arg=['sleep_120'], timeout=None) ret: {'minion01': {'retcode': 1, 'ret': ['The function "state.sls" is running as PID 12453 and was started at 2016, Jun 30 15:51:43.751796 with jid 20160630155143751796'], 'out': 'highstate'}}
[ERROR   ] Run failed on minions: minion01
Failures:
    minion01:
        Data failed to compile:
    ----------
        The function "state.sls" is running as PID 12453 and was started at 2016, Jun 30 15:51:43.751796 with jid 20160630155143751796

salt_master:
----------
          ID: do-sleep-1
    Function: salt.state
      Result: True
     Comment: States ran successfully.
     Started: 15:51:43.688091
    Duration: 15310.313 ms
     Changes:   
----------
          ID: do-sleep-2
    Function: salt.state
      Result: False
     Comment: Run failed on minions: minion01
              Failures:
                  minion01:
                      Data failed to compile:
                  ----------
                      The function "state.sls" is running as PID 12453 and was started at 2016, Jun 30 15:51:43.751796 with jid 20160630155143751796
     Started: 15:51:58.998706
    Duration: 206.86 ms
     Changes:   

Summary for salt_master
------------
Succeeded: 1
Failed:    1
------------
Total states run:     2

As you can see ‘saltutil._exec’ returns empty (or incomplete) dictionary in case of ‘find_job’ timeout.

I does not know is it bug in client code or orchestration incorrectly handles such result, but definitely it does not look like expected behavior. Probably some of you guys can help me to prepare patch to fix for this?

root@salt:/srv/salt# salt 'minion01' test.versions_report
minion01:
    Salt Version:
               Salt: 2016.3.1

    Dependency Versions:
               cffi: Not Installed
           cherrypy: Not Installed
           dateutil: Not Installed
              gitdb: Not Installed
          gitpython: Not Installed
              ioflo: Not Installed
             Jinja2: 2.7.2
            libgit2: Not Installed
            libnacl: Not Installed
           M2Crypto: Not Installed
               Mako: Not Installed
       msgpack-pure: Not Installed
     msgpack-python: 0.4.6
       mysql-python: Not Installed
          pycparser: Not Installed
           pycrypto: 2.6.1
             pygit2: Not Installed
             Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
       python-gnupg: Not Installed
             PyYAML: 3.11
              PyZMQ: 14.7.0
               RAET: Not Installed
              smmap: Not Installed
            timelib: Not Installed
            Tornado: 4.2.1
                ZMQ: 4.0.5

    System Versions:
               dist: centos 7.2.1511 Core
            machine: x86_64
            release: 3.10.0-327.22.2.el7.x86_64
             system: Linux
            version: CentOS Linux 7.2.1511 Core