ksh: Failure to launch external commands due to race condition in 'posix_spawn' usage
There has been a bug report of an exec format error when executing “cat TODO | while read line; do ls; done”.
Ref:
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=965072
- https://bugs.launchpad.net/ubuntu/+source/ksh/+bug/1887863
I was able to reproduce this issue on 93u+m 2020-07-14 and it appears to be an issue at least since 93u+ 2012-08-01.
$ cat TODO
1
2
$ cat TODO | while read line; do ls; done
ls: ls: cannot execute [Exec format error]
ls: ls: cannot execute [Exec format error]
Tagging @chrisbertin and @mirabilos - reporters of the issue.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 88
Commits related to this issue
- (TEST) Remove ksh's use of posix_spawn() for external commands ksh's use of posix_spawn(2) (via AST spawnveg(3)) to create a process in which to exec an external command involves race conditions, at ... — committed to McDutchie/ksh by McDutchie 4 years ago
- Use vfork instead of posix_spawn WARNING: This is an experimental fix for the race conditions reported at ksh93/ksh#79 and att/ast#468. There may be bugs that have not yet been caug... — committed to JohnoKing/ksh by JohnoKing 4 years ago
- Use vfork instead of posix_spawn WARNING: This is an experimental fix for the race conditions reported at ksh93/ksh#79 and att/ast#468. There may be bugs that have not yet been caug... — committed to JohnoKing/ksh by JohnoKing 4 years ago
- Fix race conditions running external commands with job control on When ksh is compiled with SHOPT_SPAWN (the default), which uses posix_spawn(3) or vfork(2) (via sh_ntfork()) to launch external comma... — committed to ksh93/ksh by McDutchie 4 years ago
- Failure to launch external commands due to posix_spawn race https://github.com/ksh93/ksh/issues/79 — committed to citrus-it/ast by citrus-it 4 years ago
- posix: Add terminal control setting support for posix_spawn Currently there is no proper way to set the controlling terminal through posix_spawn() in race free manner [1]. This forces shell implemen... — committed to zatrazz/glibc by zatrazz 3 years ago
- posix: Add terminal control setting support for posix_spawn Currently there is no proper way to set the controlling terminal through posix_spawn in race free manner [1]. This forces shell implementa... — committed to kraj/glibc by zatrazz 3 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP & fix spawnveg bugs Note that this is still a WIP. Current TODOs: - Error out if posix_spawnattr_tcsetpgrp_np is detected but posix_spawn is not. ... — committed to JohnoKing/ksh by JohnoKing 2 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP & fix spawnveg bugs This commit implements support for the glibc POSIX_SPAWN_TCSETPGROUP and posix_spawnattr_tcsetpgrp_np extension [1]. This was done wi... — committed to JohnoKing/ksh by JohnoKing 2 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP & fix spawnveg bugs This commit implements support for the glibc POSIX_SPAWN_TCSETPGROUP and posix_spawnattr_tcsetpgrp_np extensions[1]. This was done wi... — committed to JohnoKing/ksh by JohnoKing 2 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP & fix spawnveg bugs This commit implements support for the glibc POSIX_SPAWN_TCSETPGROUP and posix_spawnattr_tcsetpgrp_np extensions[1]. This was done wi... — committed to JohnoKing/ksh by JohnoKing 2 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP (#438) This commit implements support for the glibc POSIX_SPAWN_TCSETPGROUP and posix_spawnattr_tcsetpgrp_np extensions[1]. This was done with the inte... — committed to ksh93/ksh by JohnoKing 2 years ago
- Support glibc 2.35's posix_spawn_file_actions_addtcsetpgrp_np(3) This commit implements support for the glibc 2.35 posix_spawn_file_actions_addtcsetpgrp_np(3) extension[2][3], updating spawnveg(3) to... — committed to ksh93/ksh by JohnoKing 2 years ago
Here are some new tests on my Mac laptop.
With ksh’s racy use of
posix_spawn
:With ksh’s
vfork
fallback, after applying the patch based on OpenSUSE’s:…and, for reference, the classic
fork
/exec
that ksh2020 reverted to:@JohnoKing’s Linux tests and my own Mac tests (above) suggest that
vfork
performs just as well asposix_spawn
(if there is any difference, it’s lost in the margin of error), and it doesn’t have these problems. So in terms of performance, this is not going to be a significant loss at all.I just merge this patch [1] and it will be included on glibc 2.35.
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=342cc934a3bf74ac618e2318d738f22ac93257ba
I’ll note that there is currently a patch that adds a
POSIX_SPAWN_TCSETPGROUP
flag to glibc: https://patchwork.sourceware.org/project/glibc/patch/20211223173003.1037286-1-adhemerval.zanella@linaro.orgIf that patch gets merged into glibc’s upstream codebase then perhaps we could use that flag to safely avoid usage of fork() in interactive shells.
So, after this slightly desperate megathread, could the fix really be this trivial? Please test…
@hyenias, yes – disabling
SHOPT_SPAWN
(by adding-DSHOPT_SPAWN=0
to the compiler option) will still revert to the classic fork/exec mechanism, just as it does now. The patch causes the ASTspawnveg()
function to use vfork, but disablingSHOPT_SPAWN
causesspawnveg()
to not be used at all.That’s what Apple has been doing all along for their /bin/ksh, which is how I know this. I found I had to remove
-DSHOPT_SPAWN=0
as a default compiler option fromcc.darwin
in order for vfork to be used.Is this 2020 bug relevant?
https://github.com/att/ast/pull/470 which leads to https://github.com/att/ast/issues/468
and the original ast ML discussion: https://www.mail-archive.com/ast-developers@research.att.com/msg00717.html
@aweeraman, hold off a little – this fix exposes a variant of #86 that only occurs with job control active. I’m working out a good way to fix it.
This is brilliant, @McDutchie @JohnoKing et al. Thanks for your persistence!
@chrisbertin and @mirabilos - this will be included in the next upload fairly shortly.
I think it makes sense now. If job control is not active, the terminal process group ID never needs to be set; child processes simply inherit the terminal group from the parent.
However, with job control active, each child process spawned from the main shell environment (not subshells) needs to be in its own terminal process group, or you can’t separate the jobs from each other. But you can’t do this with
posix_spawn
. Hence this bug.So it’s clear that this
posix_spawn
usage is incompatible with job control and that is why this is so trivial to fix/avoid.What I don’t understand is why it doesn’t fail consistently.
That patch is an effective fix for the race conditions, at least from my own testing.
I did receive the email. I also sent a reply.
I can confirm that the race condition is not reproducible in scripts with either
posix_spawn
orvfork
; it only occurs in interactive shells (tested on Linux and FreeBSD). This means we can keep usingposix_spawn
, but only in scripts.I’ve done some testing and the race condition still occurs on FreeBSD when
spawnveg
usesvfork
. It looks likefork
will have to be used there. ~as a reference the other shell that usesvfork
on Linux (dash) usesfork
on FreeBSD (going off of the output fromtruss
)~. The feature test will still need to be fixed though since it fails withfork
, even though the race condition doesn’t actually occur withfork
.The
ENOTTY
error makes sense: the argument totcgetpgrp(2)
is a file descriptor, and 2 is standard error. And standard error is not on a tty; iffe redirects standard error (to /dev/null, probably) so that negative test results don’t clutter the terminal with compiler errors.I know how I would deal with that in shell: I’d use a
2>/dev/tty
redirection. In C I’d probably just open /dev/tty usingopen(2)
and use the returned file descriptor instead of 2, something like this:Re the crash (which I can reproduce on Linux but not on the Mac), I think it makes some vague kind of sense that this would terminate an interactive shell, as you’re running
tcsetpgrp
on the terminal associated with the interactive shell, which I think could cause an inconsistent state. I could be off base there, but my testing says that it works when run non-interactively, like:I note that on Linux, $? is 0, but on the Mac and FreeBSD, it is 1… bad news?
The
posix_spawn
API does not have a standard way of setting the terminal group in a child process.POSIX_SPAWN_SETPGROUP
exists, but that doesn’t set the terminal group. From Glenn Fowler’s response to a similar question on the old mailing list:Section of code in
spawnveg
that usesPOSIX_SPAWN_SETPGROUP
: https://github.com/ksh93/ksh/blob/bd88cc7f4f7207b48104e37cca90848adf5cb10c/src/lib/libast/comp/spawnveg.c#L62-L63As far as extensions to
posix_spawn
are concerned, the only one I’m aware of that may fix this problem isPOSIX_SPAWN_TCSETPGROUP
, which is only present in QNX and BlackBerry OS: http://www.qnx.com/developers/docs/7.0.0/index.html#com.qnx.doc.neutrino.lib_ref/topic/p/posix_spawn.html https://developer.blackberry.com/native/reference/core/com.qnx.doc.neutrino.lib_ref/topic/p/posix_spawn.htmlI found the trigger for the freeze. It is the
redirect 2>/dev/null
at the end of the test, which suppresses the extra newline ksh prints at the end of an interactive session. Removing that fixes the freeze, but prints an extra newline while running the tests. But an explicitexit
also works to suppress that newline, and does not cause a freeze.Here is a new version of the regression test that avoids the freeze. Note that stderr needs to be on a terminal to trigger a test failure on ksh with
posix_spawn
usage.This is the code
spawnveg
runs in the child process afterfork
andvfork
: https://github.com/ksh93/ksh/blob/bd88cc7f4f7207b48104e37cca90848adf5cb10c/src/lib/libast/comp/spawnveg.c#L207-L225tcsetpgrp
andioctl
aren’t run when launching commands. Insertingerror(ERROR_warn(0)
in theif(m)
statements can be used to confirm it:This issue can be fixed for
vfork
andfork
by runningtcsetpgrp
(orioctl
) beforesetpgid
. Below is a quick patch that does that:After applying the above patch, I’m no longer able to reproduce the race condition with
fork
orvfork
whenSHOPT_SPAWN
is enabled (reproducer). edit: This only fixes the race condition on FreeBSD whenspawnveg
usesfork
.vfork
can’t be used on FreeBSD, although it works fine on Linux (see https://github.com/ksh93/ksh/issues/79#issuecomment-661426491).Since the thread is now quite long, here are direct links to the OpenSUSE patches: Disable
posix_spawn
(fallback tovfork
): https://build.opensuse.org/package/view_file/shells/ksh/ksh-qemu.patch?expand=1Disable
vfork
(combine with above patch to forcespawnveg
to usefork
): https://build.opensuse.org/package/view_file/shells/ksh/ksh93-disable-vfork.dif?expand=1I was reading through this issue’s content again. In it, you referenced a Tue, 14 Jun 2011 13:33:15 -0700 email from David Korn that stated:
Is it possible to create a shim program that performs these two lines of code then executes the original program? If so, does that resolve the issue, or would that be too messy and be slower than the fork solution? The shim program might even be able to be a builtin if possible. Perhaps an environment flag passed along to the subshell that tells ksh to fixup the terminal.
vfork() is a pre-VM, or, at least pre-COW, semi-kludge and I thought it would have been retired by now… IMHO…
Chris (chris.bertin@gmail.com or chris.bertin@protonmail.com)
On Sat, Jul 18, 2020 at 9:00 AM Martijn Dekker notifications@github.com wrote:
@JohnoKing: Thank you for your review of the vfork caveats. Would it be possible to introduce the vfork implementation so one could selectively compile using vfork, posix_spawn, or fork [something like
SHOPT_SPAWN
]? I think having the ability to alter the way ksh creates a child process may assist in confirming its use and/or troubleshooting in the future. It also might be helpful to have some feature flag identifying it.The
lib_poll_fd_2
feature test doesn’t need to be removed to fix theposix_spawn
related bugs. That change was likely made to fix something specific to QEMU.Some notes regarding ksh usage of
vfork
:The libast spawnveg function uses the shared memory of
vfork
to its advantage. There is aniffe
feature test for detecting this part ofvfork
behavior: https://github.com/ksh93/ksh/blob/37a9c34515c8c966447f75b10a5205934ca51d7d/src/lib/libast/features/lib#L228-L241The
sigcritical
function has a comment aboutvfork
, which is related to the first caveat in thevfork
man page.: https://github.com/ksh93/ksh/blob/37a9c34515c8c966447f75b10a5205934ca51d7d/src/lib/libast/misc/sigcrit.c#L167-L173The multi-threading caveat in the
vfork
man page isn’t relevant because ksh is a single-threaded program. There are multi-threaded ksh plugins, but those are incompatible with vmalloc, which ksh uses instead of standard malloc (see https://github.com/att/ast/issues/1449#issuecomment-581157955).If
posix_spawn
is removed it shouldn’t be done by removing theSHOPT_SPAWN
ifdef; OpenSUSE’s patch should be instead. Results from performance testing on my Linux machine (ksh is compiled withCCFLAGS=-O2
):Relevant
strace
output fromksh -c '/usr/bin/true; /usr/bin/true'
:The OpenSUSE patch causes
vfork
to be used instead ofposix_spawn
which appears to increase performance slightly while still fixing the exec format error. RemovingSHOPT_SPAWN
causes the OS implementation offork
to be used instead ofvfork
andposix_spawn
, which is slower (on Linuxfork
uses clone, but that doesn’t make up for the performance deficit).Pointing out, tho, that if you follow those 2020 bugs all the way through, they lead you back to the performance degradation that was an large part of the original ksh2020 kerfuffle.
On macOS, the compiler flags used by the Apple version (which I’ve incorporated into
src/cmd/INIT/cc.darwin
) disableSHOPT_SPAWN
. I can confirm that, if I change that flag to-DSHOPT_SPAWN=1
and recompile from scratch, I can reproduce this bug on the Mac.I can also reproduce this bug on FreeBSD, on which
SHOPT_SPAWN
is not disabled by default.Neither of these two systems use GNU libc, so it seems likely that this is a bug in ksh and not in any C library. Does anyone here have an idea how to fix it?
Alternatively, I wonder if there would be any disadvantage to simply removing the use of
posix_spawn
altogether.I can confirm that disabling
posix_spawn
causes this bug to go away. OpenSUSE disables it in their build of ksh with a patch titled ksh-qemu.patch, which implies there is another bug related toposix_spawn
that affects QEMU.