elvish: SIGPIPE should not be considered a failure for data-producers in a pipeline
This is debatable perhaps, and one could argue that it’s a bug in most Unix commands that currently exist: But I think SIGPIPE should not be considered an error.
The rationale is this: It’s not uncommon to create pipelines that will terminate one or more of their commands with SIGPIPE - and typically this does not mean that anything has failed, but rather that one of the processes in the pipeline has stopped consuming values, and the program that was producing values for that process terminated as a result. I think it’s fair to argue that these producers should not return an error status in this case, but the usual (default, in fact!) behavior is to terminate on an unhandled SIGPIPE, or catch the signal and terminate with the equivalent error code.
So for instance, this produces a SIGPIPE:
e:sort --random-sort /usr/share/dict/american-english | e:head -n 10 (Produce a list of random words)
“head” reads the first 10 lines of inputs and then terminates, breaking the pipe. “sort” then receives SIGPIPE (or possibly, equivalent information via another means, depending on implementation) next time it writes to its output. Being unable to write out more data, it terminates. This is not an "exception"al condition, but rather a fundamental part of how stream processing works in shell pipelines.
On the flip side of this argument: a “graceful” shutdown of a pipeline is not the only condition that can produce a SIGPIPE. A program could fail with SIGPIPE due to a purely internal error, or due to a connection loss, etc. This is why I consider my argument “debatable” and say “it could be considered a bug in the programs called by the shell” - If the SIGPIPE occurs in a scenario where it should not be treated as an error, then arguably “e:sort” and so on should not terminate with a non-zero exit code. I think it’s a fair argument that people should simply recognize this and capture errors or use try/catch when running a pipeline that could reasonably be expected to SIGPIPE. But it’s very typical for SIGPIPE to simply indicate that a pipeline has shut down. I don’t think there is a set of criteria that can be applied to reliably distinguish between a “pipeline SIGPIPE” and an “internal error SIGPIPE” - the sequence in which processes terminate isn’t a reliable indicator because SIGPIPE is triggered by the consumer closing its input, which could happen before termination - and if we said “SIGPIPEs aren’t exceptions if they’re generated by producers in a pipeline” there’s always the chance that we’re suppressing some true internal failure.
(I think treating process exit codes as exceptions is a good idea, though a challenging one to resolve against a tradition in which exit codes mostly don’t matter…)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 60 (42 by maintainers)
Commits related to this issue
- Treat SIGPIPE termination as a successful exit Resolves #952 — committed to krader1961/elvish by krader1961 4 years ago
- Treat SIGPIPE termination as a successful exit Resolves #952 — committed to krader1961/elvish by krader1961 4 years ago
- Detect and suppress SIGPIPE caused by the next command in a pipeline exiting early. This addresses #952. — committed to elves/elvish by xiaq 3 years ago
Getting back to this again.
First let me respond to the comments that are most relevant for the resolution of the issue.
I claimed that “it is not uncommon for networking programs to crash with SIGPIPE”, so I went digging a bit. I thought that this was possible for the openssh client and curl, since I’ve seen both printing “broken pipe” and exiting when the network connection drops. But it turns out that both programs actually ignore SIGPIPE (openssh, curl). What’s responsible for the error message is most likely the return value from
writebeingEPIPE(and the canonical string message forEPIPEin GNU libc is “broken pipe”).I checked a few more well-known programs. OpenBSD netcat, socat, GNU wget all ignore SIGPIPE, so does BIND 9 and its utilities (nsloopup, dig, etc.).
However, busybox’s wget does seem to not ignore SIGPIPE, and it can plausibly be used in a pipeline, for example to download a file and extract it without saving it to a temporary file:
The
busybox wgetcommand can be killed by SIGPIPE eithe because of a network error, or because of the next command in the pipeline exits earlier than it. This is where an additional check on whether the read end of the pipe is actually closed makes a difference: it can distinguish the two cases.In fact, searching “SIGPIPE” in busybox’s /networking directory indicates that it’s a measured choice, rather than an oversight, to not ignore SIGPIPE in its wget implementation. Almost all the server programs do ignore SIGPIPE because they don’t want a single client to crash the whole server: the only exception is
dnsd, and that’s because busybox’s dnsd only supports UDP, and only TCP sockets can generate SIGPIPE. On the other hand, all the client programs do not ignore SIGPIPE because in this case, the default behavior of SIGPIPE (terminating the program) is a reasonable strategy to handle a closed connection.I don’t question your credentials here, but I’d take your conclusion with a grain of salt.
Scripts in traditional shells tend to have a rather “meh” approach towards any kind of error handling. A “casual” automation scripts probably doesn’t run with
set -eorset -o pipefail, neither does it religiously check the exit code of every command. The default in traditional shell scripting is, really, just keep going regardless of what has happened. If the script doesn’t do what is intended, eventually the sysadmin would notice and then debug it. A typical debug experience usually involves examining the stderr of the script, rarely the exit code of each command. The latter is not made easy by traditional shells because (1) the exit code is kept in a variable$?that gets overwritten immediately by the next command (2) unless you use bash which has $PIPESTATUS, only the exit status of the last command in a pipeline is available. Even if there was a case where a networking program did exit with SIGPIPE, how would you notice?I am speaking from my experience of course, and my assumptions of how UNIX sysadmins operate may not apply to your experience. If your experience does involve more rigorous approaches towards error handling in traditional shell scripts, I’d be very curious to hear about it and happy to be proven wrong.
There is another reason I’d take your conclusion with a grain of salt: IIUC your experience was mostly in dealing with UNIX-like systems running in enterprise environments. I can only guess about your work environment, but enterprise setups usually mean stable network environments, and when the network does go down, one probably has more to worry about than a few shell scripts misbehaving.
Well, POSIX-like shells ignore any error from commands that are not the last in a pipeline, and the ignoring of SIGPIPE is a logical consequence of that. To emulate POSIX-like shells would be to suppress exceptions raised in any command that is not the last in the pipeline, and that’s a bad idea. Suppressing SIGPIPE only is an innovation.
So I maintain that the suppression of SIGPIPE should only happen when Elvish detects that the read end of the output pipe has indeed been closed. In fact, I claim that this is not just a better heuristic, it is a near-optimal heuristic.
To see this, let’s consider the pipeline
network-cmd | filter-cmd, wherenetwork-cmdis a networking program that can be killed by SIGPIPE either due to a closed TCP connection or due tofilter-cmdclosing the read end of its output. Now, when Elvish detects that it was killed by SIGPIPE, there are several possible situations:The read end of the output has not been closed. Elvish does not suppress the SIGPIPE (and turns it into an exception). Because Elvish knows that the read end of the output has not been closed, the SIGPIPE must be caused by something else; this is always the correct behavior.
The read end of the output has been closed, meaning that
filter-cmdhas exited without consuming all of the output ofnetwork-cmd. Elvish suppresses the SIGPIPE, as if the command exited normally. There are two possibilities regarding what actually happened tonetwork-cmd:2a. It is likely that the program was indeed killed by writing to the closed pipe. In this case, the behavior of Elvish is correct.
2b. It is still possible that the SIGPIPE was actually caused by a closed TCP connection. However, in this case, it’s still true that
filter-cmdhas exited, so the output of the whole pipeline is already finalized and would not be impacted by whatevernetwork-cmdwill write out later. So even ifnetwork-cmddid run into some network error, it is safe to ignore that error since it’s irrelevant.There are some remaining edge case in 2b where Elvish’s handling (suppressing the SIGPIPE exit code) is not ideal. The
network-cmdmay have some additional side effect besides writing to stdout - for example, logging the request and/or response headers to an additional file. When there was a network error, such side effects may be incomplete, and argubly an exception should be raised. However, it’s not possible to detect such side effects in general, and I’m happy to live with the false negative in such edge cases.Finally, I want to solve the problem in the context of not just UNIX external programs, but also builtin commands, which currently lack a SIGPIPE-like mechanism for terminating writers whose reader has gone away (#923 is relevant here).
Here’s the design I plan to implement:
Introduce a new error type, say
ErrOutputReaderGone, to denote that the reader of the output is gone;Change builtin output commands (
echo,put, etc.) to raiseErrOutputReaderGonewhen it detects that the reader of the output is gone;When an external command gets terminated by SIGPIPE, check whether the read end of its output has indeed been closed, and if so, turn that into
ErrOutputReaderGone;In a pipeline, if any command other than the last one raises
ErrOutputReaderGone, silently swallow the exception.I went ahead and created issue #1151 for elvish options to external commands. So please let the discussion continue there. I think @xiaq ought to merge the pull request from @krader1961 and close this issue; and then we can continue discussing my options proposal at a (much) more leisurely pace.
Not sure about the confusion - personally I think I’ve been pretty consistent on that point: when I say a consumer stops consuming, I say the producer may get SIGPIPE. As you say it won’t always happen.
Generally I still lean in the direction that SIGPIPE should not be considered an error - because by convention, in many common cases it is not. Terminating on SIGPIPE is the default behavior and many programs do not override that. I also believe a shell needs to work with the programs it will be running. If it starts feeling like it’s working against them I think that’s a problem. And from that perspective, responding to a common Unix idiom by reporting an unhandled exception is not ideal to say the least.
But at the same time, I understand why Elvish would treat this condition as an exception, and why it might even be worth causing a little friction in order to push for something better. If a process exits in response to a signal, that means it got the signal and didn’t handle it, and as a result the default behavior kicked in - which for that signal was to terminate the process. If you make the integrity of your scripting environment a priority, then for a program to get halted by a signal is kind of a big deal. So from that perspective I can appreciate the argument that an uncaught exception should be treated as an error - and I think every program should either catch or disable SIGPIPE and deal with broken connections through its own internal logic - so that as you say, if a program did fail due to a broken pipe, the shell wouldn’t get a SIGPIPE status from it - and the only time the shell would ever see a process exiting with SIGPIPE status was if that program were broken - because we now define “falling back to the default handling of SIGPIPE and terminating on that signal” to mean the program is “broken” - because that kind of exit gives us no assurance that the program actually terminated gracefully and made a considered decision on whether it should return with an error status.
The problem is that’s not the reality we’re dealing with here. There’s loads of software out there that’s coded to the default and exits on SIGPIPE. There are loads of programmers out there who will stubbornly refuse to change the behavior of their programs because they want to uphold Unix tradition or believe the Elvish approach to be overly pedantic. And there are users who won’t have patience for a shell that reports errors on a regular basis where there seemingly are none. To some extent I think the shell needs to work with software as it exists rather than pushing for it to work a different way - and from that perspective I’d say the pragmatic reality of the situation is that SIGPIPE is not generally an error.
I guess I’m repeating myself here and I don’t mean to, I’m just trying to think through all this.
There seems to be some misunderstandings about when and why the kernel sends SIGPIPE to a process. Perhaps the most important thing to realize is that just because the LHS process is not killed by SIGPIPE does not mean all the data it wrote was read.
Consider a command that writes 8KB or less of data before it terminates. Let’s call it
cmd1. Why 8KB? Because I’m not aware of any UNIX like OS that buffers less than 8KB – going all the way back to the early 1980’s. If you docmd1 | cmd2iscmd1guaranteed not to receive SIGPIPE? Even though all the data it writes fits in the pipe buffer? No! There are four scenarios:cmd2reads all the data before it exits. No SIGPIPE is sent tocmd1. This is the usual case.cmd2reads some, but not all, of the data before it exits. No SIGPIPE is sent tocmd1. This is the second most common case.cmd2reads none of the data and it exits, closing the read side of the pipe, aftercmd1has written its data and exited. No SIGPIPE is sent tocmd1. All the data written bycmd1is simply discarded. This is rare but not unheard of.cmd2reads nothing (or a fraction of the data) and exits beforecmd1writes all its data. SIGPIPE is sent tocmd1. Any data already written bycmd1but not read bycmd2is discarded.It should be obvious it doesn’t matter how much data is buffered by the pipe. SysVR4 STREAMS based pipes, for example, usually buffer 128KB. A larger buffer simply makes it less likely the
cmd1process will receive SIGPIPE. It does not change the probability that some (or all) of the data written bycmd1will never be read and silently discarded by the kernel.Consider the
ps waux | grep -q patternexample that caused me to change my mind. I don’t care if all of theps wauxoutput is read. More importantly, I don’t care if thepsprocess is killed by SIGPIPE because thegrepfound a match and exited beforepshad written all its data.Consider a canonical useless-use-of-cat example:
cat a_file | cmd2. Here, again,cmd2only reads some of the data. We don’t care that thecatmight receive SIGPIPE depending on how much data is in a_file, how much data the pipe buffers, and how much datacmd2reads before exiting. The example is equivalent tocmd2 < a_file. We only care about the exit status ofcmd2.I’ve been thinking about this a lot. In the thirty-five years I’ve been using, and supporting, UNIX I can’t think of a single instance where it was necessary for the shell to report that a pipe LHS process died from SIGPIPE. There have been some cases where the LHS process needed to cleanup after receiving a SIGPIPE but obviously the process has to install a SIGPIPE signal handler. That signal handler either exits with a zero or non-zero status as appropriate for the situation. But note that the shell is then treating the explicit status as success or failure using the usual rules. The shell never even knows that the process handled SIGPIPE in that case.
It’s important to note that using bash’s
set -o pipefaildoes not cause it to report an error if the LHS of a pipe fails due to SIGPIPE. It only causes it to report a failure for any non-zero status other than termination due to SIGPIPE. Elvish is different from every other shell with respect to treating SIGPIPE as an error.Elvish does implement pipefail semantics. You can see this with a trivial experiment:
Well, I can see the value in taking the hard line that a SIGPIPE termination is a failure like any other signal exit. And I think it’s a fair point that it may amount to little more than a nuisance in the end. I think my main reservation is simply that while Elvish is taking its own direction, it’s still aiming to fill that “Unix shell” role as well - and as such, somehow, to some extent, it must play well with that existing software. I think if there’s too much friction there, if it’s too awkward or complicated to work with regular Unix tools in Elvish, that would be a real problem. I still feel like erroring on SIGPIPE is at the very least an unfortunate bit of friction, because causing SIGPIPEs is practically idiomatic in Unix shell, and it’s very common for programs to simply let themselves be terminated by the signal.
I don’t know that I have a good solution for how to treat SIGPIPE as an error and work and play well with pre-existing software. I think it’s a complicated problem that stems, quite simply, from making a Unix shell that’s very unlike a Unix shell. And I 100% support that mission, it’s something I’ve been interested in as well, for a long time. The way I see it, a new shell needs to interface well with regular, existing programs, or people just won’t use it - and that always seems to limit where I can go with a design. There’s always a compromise somewhere. But as much as it frustrates me, when I try to think about what the shell could be - I think finding a way to work well with existing stuff is absolutely vital, even if it’s at the cost of other design goals.
I don’t know why you’re talking about builtins. I even prefixed “sort” and "head " with “e:” so there would be no question in my example that I’m talking about external commands. The question is strictly whether the shell should raise an exception for this condition, which IMO is a very common, non-failure case in existing shell utilities. Was I not clear about the situation I’m describing? Let me try again:
If you have a Unix pipeline like this:
cmd1 | cmd2The data exchange between the two processes is almost unidirectional, from cmd1 to cmd2. But in fact some information from cmd2 does make its way back to cmd1: if cmd2 is not consuming the data fast enough, the pipe buffer will fill up. If cmd2 closes the pipe, cmd1 will see that the pipe is broken and can respond by shutting itself down.And SIGPIPE actually makes this behavior the default, effectively: because if a program tries to write to a pipe whose read end is closed, it will receive SIGPIPE and terminate.
The issue here is that this default behavior is found in all sorts of Unix utilities, and it’s not an error condition. But when you run a pipeline like that in elvish, when the pipe breaks and the data producer terminates with SIGPIPE, elvish sees that the program exited with a non-zero result code, and raises an exception:
While I think it makes sense to treat non-zero exit codes from external programs as exceptions, SIGPIPE is an exception, IMO, because it usually does not indicate a failure. I understand this creates some disparity in the handling of exit codes, but overall I think it probably makes more sense, when an external tool terminates with SIGPIPE, to not treat it as an error.