salt: salt exit codes
is there a reason why test.* returns False and exits with bash exit code 0 instead of 1 ?
~ # salt some-minion file.access /var/run/reboot-required f; echo $?
<>:
False
0
~ # salt some-other-minion file.access /var/run/reboot-required f; echo $?
<>:
True
0
About this issue
- Original URL
- State: closed
- Created 10 years ago
- Reactions: 18
- Comments: 60 (39 by maintainers)
+1
I’ve just raised/tested few other similar issues particularly important for me. Searching for “zero exit” gives a list of issues related to lack of error indication for CLI.
Lack of usability
It effectively makes automation (the very similar reason why Salt is even used in the first place) around Salt commands very inconvenient. And it definitely seems pervasive for Salt (many CLI commands, many functions called through CLI, etc.).
For example, the following are some use cases which require developing custom scripts to analyze Salt output just get Failed/Succeeded result:
+1 This issue has caused great pain in our normal deployment process. Actually it has caused a few production live site issues already. We have to apply other ad-hoc detection of the deployment result which is really unnecessary. For state.highstate at least we have succeeded and failed summary, but for others such as state.apply or cmd.run we have to review all the output and check if the command really succeeds or not.
I can’t believe this is addressed as low priority. It is opened 2 years ago and still open.
Using salt version below, the issue with the incorrect exit code is still present when you add --batch-size and --batch-wait parameters
sudo salt -v -L minion-1,minion-2 --batch-size 50% --batch-wait 1 state.apply queue=True test-state
Even though there is an error:
salt exit status is 0.
@saltstack/team-core Since Fluorine was released, this can be closed.
I would expect that there would be a range of retcodes for the specific tool and then another range for failures in anything that the tool manages. For example 0-127 might represent status for the specific tool (viz.
salt
) where subprocesses/managed devices would have an automatic passthrough of their retcode+128. That way if all you care about is pass/fail you get it without the--retcode-passthrough
option. If you want to distinguish between a failure in the invoked tool vs. managed devices then you could test the range on the retcode. One thing to keep in mind is that status of 128+SIGNUM. Often times on POSIX exit status is only 8 bits rather than the full int - although newer calls that usewaitid()
have access to the full int (which is most software by now).I have no idea what Windows does.
We could certainly compress to a narrow subrange just for minion retcode pass-through.
+1 How come there is still no way of knowing the exit status of a remote command (cmd.run) ? salt-run jobs.print_job [jid] doesn’t return anything status-related… 😦
@oliver-dungey As explained above, the retcode changes are in the Fluorine release (2019.2.0).
I think this is expected, as far as I know. To get status code from operation executed by salt-call you have to use
--retcode-passthrough
if nothing was changed.Wait, what is the current status of retcodes for salt operations? It’s been 4 years since the issue was created…
@meggiebot Please include me when this is added to a sprint for Carbon. I would like to coordinate these changes with salt-api and Salt Enterprise so that all three are on the same page going forward.
+1
This is really serious compliance issue. How I can make sure without parsing output that all my states applied correctly or will applied correctly?
This not being addressed as top priority is a pure negligence. How do you think salt* usage would be scripted without this bug being resolved? How do you test your executables if your exit codes don’t have a meaning??
It is also important to expose this to the salt-api. Here is the output from a 2015.8.8 salt command invocation of
cmd.run 'false 1'
:While the API does not include the necessary info when sending to the / endpoint:
When you lookup the job via /jobs/JID you get slightly more info, but still nothing useful for determining exit status.
and this same issue occurs with the service execution module. Here is an example of a service failing to restart: