salt: salt exit codes

is there a reason why test.* returns False and exits with bash exit code 0 instead of 1 ?

~ # salt some-minion file.access  /var/run/reboot-required f; echo $?
<>:
    False
0

~ #  salt some-other-minion  file.access  /var/run/reboot-required f; echo $?
<>:
    True
0

About this issue

Original URL
State: closed
Created 10 years ago
Reactions: 18
Comments: 60 (39 by maintainers)

Links to this issue

Track failure of command on a Salt minion - Stack Overflow

Most upvoted comments

I’ve just raised/tested few other similar issues particularly important for me. Searching for “zero exit” gives a list of issues related to lack of error indication for CLI.

Lack of usability

It effectively makes automation (the very similar reason why Salt is even used in the first place) around Salt commands very inconvenient. And it definitely seems pervasive for Salt (many CLI commands, many functions called through CLI, etc.).

For example, the following are some use cases which require developing custom scripts to analyze Salt output just get Failed/Succeeded result:

VM provisioning tools (such as Vagrant with all possible providers) which use Salt to configure new instance won’t detect failure.
Any Jenkins (or other CI platform) jobs which apply configuration or run other Salt jobs won’t detect failure.

uvsmtid on May 24, 2015

+1 This issue has caused great pain in our normal deployment process. Actually it has caused a few production live site issues already. We have to apply other ad-hoc detection of the deployment result which is really unnecessary. For state.highstate at least we have succeeded and failed summary, but for others such as state.apply or cmd.run we have to review all the output and check if the command really succeeds or not.

I can’t believe this is addressed as low priority. It is opened 2 years ago and still open.

zhany on Apr 21, 2016

Using salt version below, the issue with the incorrect exit code is still present when you add --batch-size and --batch-wait parameters

Salt Version:
           Salt: 2019.2.0
 
Dependency Versions:
           cffi: 1.6.0
       cherrypy: unknown
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: 0.26.3
        libnacl: Not Installed
       M2Crypto: 0.31.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: 3.7.3
         pygit2: 0.26.4
         Python: 2.7.5 (default, Apr  9 2019, 14:30:50)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.5.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

sudo salt -v -L minion-1,minion-2 --batch-size 50% --batch-wait 1 state.apply queue=True test-state

Even though there is an error:

jid:
    20190613124535690469
retcode:
    1
gc-euw1-salt-1:
    Data failed to compile:
----------
    Rendering SLS 'base:test-state' failed: Jinja syntax error: expected token 'end of statement block', got 'string'; line 1

salt exit status is 0.

echo $?
0

dz-pyps on Jun 13, 2019

@saltstack/team-core Since Fluorine was released, this can be closed.

terminalmage on Mar 26, 2019

I would expect that there would be a range of retcodes for the specific tool and then another range for failures in anything that the tool manages. For example 0-127 might represent status for the specific tool (viz. salt) where subprocesses/managed devices would have an automatic passthrough of their retcode+128. That way if all you care about is pass/fail you get it without the --retcode-passthrough option. If you want to distinguish between a failure in the invoked tool vs. managed devices then you could test the range on the retcode. One thing to keep in mind is that status of 128+SIGNUM. Often times on POSIX exit status is only 8 bits rather than the full int - although newer calls that use waitid() have access to the full int (which is most software by now).

I have no idea what Windows does.

We could certainly compress to a narrow subrange just for minion retcode pass-through.

plastikos on Jun 10, 2016

+1 How come there is still no way of knowing the exit status of a remote command (cmd.run) ? salt-run jobs.print_job [jid] doesn’t return anything status-related… 😦

zerthimon on Apr 13, 2016

@oliver-dungey As explained above, the retcode changes are in the Fluorine release (2019.2.0).

terminalmage on Mar 26, 2019

I think this is expected, as far as I know. To get status code from operation executed by salt-call you have to use --retcode-passthrough if nothing was changed.

adubkov on Mar 4, 2019

Wait, what is the current status of retcodes for salt operations? It’s been 4 years since the issue was created…

Oloremo on Nov 3, 2018

@meggiebot Please include me when this is added to a sprint for Carbon. I would like to coordinate these changes with salt-api and Salt Enterprise so that all three are on the same page going forward.

whiteinge on May 9, 2016

htssouza on Apr 27, 2016

This is really serious compliance issue. How I can make sure without parsing output that all my states applied correctly or will applied correctly?

adubkov on Apr 26, 2016

This not being addressed as top priority is a pure negligence. How do you think salt* usage would be scripted without this bug being resolved? How do you test your executables if your exit codes don’t have a meaning??

cjelli on Apr 14, 2016

It is also important to expose this to the salt-api. Here is the output from a 2015.8.8 salt command invocation of cmd.run 'false 1':

# salt 'web1' cmd.run 'false 1'
web1:
ERROR: Minions returned with non-zero exit code

While the API does not include the necessary info when sending to the / endpoint:

{
    "return": [
        {
            "web1": ""
        }
    ]
}

When you lookup the job via /jobs/JID you get slightly more info, but still nothing useful for determining exit status.

{
    "info": [
        {
            "Arguments": [
                "false 1"
            ],
            "Function": "cmd.run",
            "Minions": [
                "web1"
            ],
            "Result": {
                "web1": {
                    "return": ""
                }
            },
            "StartTime": "2016, Apr 11 15:59:42.643646",
            "Target": "web1",
            "Target-type": "glob",
            "User": "jenkins",
            "jid": "20160411155942643646"
        }
    ],
    "return": [
        {
            "web1": ""
        }
    ]
}

and this same issue occurs with the service execution module. Here is an example of a service failing to restart:

{
    "info": [
        {
            "Arguments": [
                "klsdklsd"
            ],
            "Function": "service.restart",
            "Minions": [
                "web1"
            ],
            "Result": {
                "web1": {
                    "return": false
                }
            },
            "StartTime": "2016, May 27 14:08:28.970138",
            "Target": "web1",
            "Target-type": "glob",
            "User": "jenkins",
            "jid": "20160527140828970138"
        }
    ],
    "return": [
        {
            "web1": false
        }
    ]
}

mchugh19 on May 27, 2016