salt: salt exit codes

is there a reason why test.* returns False and exits with bash exit code 0 instead of 1 ?

~ # salt some-minion file.access  /var/run/reboot-required f; echo $?
<>:
    False
0

~ #  salt some-other-minion  file.access  /var/run/reboot-required f; echo $?
<>:
    True
0

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Reactions: 18
  • Comments: 60 (39 by maintainers)

Most upvoted comments

+1

I’ve just raised/tested few other similar issues particularly important for me. Searching for “zero exit” gives a list of issues related to lack of error indication for CLI.

Lack of usability

It effectively makes automation (the very similar reason why Salt is even used in the first place) around Salt commands very inconvenient. And it definitely seems pervasive for Salt (many CLI commands, many functions called through CLI, etc.).

For example, the following are some use cases which require developing custom scripts to analyze Salt output just get Failed/Succeeded result:

  • VM provisioning tools (such as Vagrant with all possible providers) which use Salt to configure new instance won’t detect failure.
  • Any Jenkins (or other CI platform) jobs which apply configuration or run other Salt jobs won’t detect failure.

+1 This issue has caused great pain in our normal deployment process. Actually it has caused a few production live site issues already. We have to apply other ad-hoc detection of the deployment result which is really unnecessary. For state.highstate at least we have succeeded and failed summary, but for others such as state.apply or cmd.run we have to review all the output and check if the command really succeeds or not.

I can’t believe this is addressed as low priority. It is opened 2 years ago and still open.

Using salt version below, the issue with the incorrect exit code is still present when you add --batch-size and --batch-wait parameters

Salt Version:
           Salt: 2019.2.0
 
Dependency Versions:
           cffi: 1.6.0
       cherrypy: unknown
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: 0.26.3
        libnacl: Not Installed
       M2Crypto: 0.31.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: 3.7.3
         pygit2: 0.26.4
         Python: 2.7.5 (default, Apr  9 2019, 14:30:50)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.5.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

sudo salt -v -L minion-1,minion-2 --batch-size 50% --batch-wait 1 state.apply queue=True test-state

Even though there is an error:

jid:
    20190613124535690469
retcode:
    1
gc-euw1-salt-1:
    Data failed to compile:
----------
    Rendering SLS 'base:test-state' failed: Jinja syntax error: expected token 'end of statement block', got 'string'; line 1

salt exit status is 0.

echo $?
0

@saltstack/team-core Since Fluorine was released, this can be closed.

I would expect that there would be a range of retcodes for the specific tool and then another range for failures in anything that the tool manages. For example 0-127 might represent status for the specific tool (viz. salt) where subprocesses/managed devices would have an automatic passthrough of their retcode+128. That way if all you care about is pass/fail you get it without the --retcode-passthrough option. If you want to distinguish between a failure in the invoked tool vs. managed devices then you could test the range on the retcode. One thing to keep in mind is that status of 128+SIGNUM. Often times on POSIX exit status is only 8 bits rather than the full int - although newer calls that use waitid() have access to the full int (which is most software by now).

I have no idea what Windows does.

We could certainly compress to a narrow subrange just for minion retcode pass-through.

+1 How come there is still no way of knowing the exit status of a remote command (cmd.run) ? salt-run jobs.print_job [jid] doesn’t return anything status-related… 😦

@oliver-dungey As explained above, the retcode changes are in the Fluorine release (2019.2.0).

I think this is expected, as far as I know. To get status code from operation executed by salt-call you have to use --retcode-passthrough if nothing was changed.

Wait, what is the current status of retcodes for salt operations? It’s been 4 years since the issue was created…

@meggiebot Please include me when this is added to a sprint for Carbon. I would like to coordinate these changes with salt-api and Salt Enterprise so that all three are on the same page going forward.

This is really serious compliance issue. How I can make sure without parsing output that all my states applied correctly or will applied correctly?

This not being addressed as top priority is a pure negligence. How do you think salt* usage would be scripted without this bug being resolved? How do you test your executables if your exit codes don’t have a meaning??

It is also important to expose this to the salt-api. Here is the output from a 2015.8.8 salt command invocation of cmd.run 'false 1':

# salt 'web1' cmd.run 'false 1'
web1:
ERROR: Minions returned with non-zero exit code

While the API does not include the necessary info when sending to the / endpoint:

{
    "return": [
        {
            "web1": ""
        }
    ]
}

When you lookup the job via /jobs/JID you get slightly more info, but still nothing useful for determining exit status.

{
    "info": [
        {
            "Arguments": [
                "false 1"
            ],
            "Function": "cmd.run",
            "Minions": [
                "web1"
            ],
            "Result": {
                "web1": {
                    "return": ""
                }
            },
            "StartTime": "2016, Apr 11 15:59:42.643646",
            "Target": "web1",
            "Target-type": "glob",
            "User": "jenkins",
            "jid": "20160411155942643646"
        }
    ],
    "return": [
        {
            "web1": ""
        }
    ]
}

and this same issue occurs with the service execution module. Here is an example of a service failing to restart:

{
    "info": [
        {
            "Arguments": [
                "klsdklsd"
            ],
            "Function": "service.restart",
            "Minions": [
                "web1"
            ],
            "Result": {
                "web1": {
                    "return": false
                }
            },
            "StartTime": "2016, May 27 14:08:28.970138",
            "Target": "web1",
            "Target-type": "glob",
            "User": "jenkins",
            "jid": "20160527140828970138"
        }
    ],
    "return": [
        {
            "web1": false
        }
    ]
}