ksh: Failure to launch external commands due to race condition in 'posix_spawn' usage
There has been a bug report of an exec format error when executing “cat TODO | while read line; do ls; done”.
Ref:
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=965072
- https://bugs.launchpad.net/ubuntu/+source/ksh/+bug/1887863
I was able to reproduce this issue on 93u+m 2020-07-14 and it appears to be an issue at least since 93u+ 2012-08-01.
$ cat TODO
1
2
$ cat TODO | while read line; do ls; done
ls: ls: cannot execute [Exec format error]
ls: ls: cannot execute [Exec format error]
Tagging @chrisbertin and @mirabilos - reporters of the issue.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 88
Commits related to this issue
- (TEST) Remove ksh's use of posix_spawn() for external commands ksh's use of posix_spawn(2) (via AST spawnveg(3)) to create a process in which to exec an external command involves race conditions, at ... — committed to McDutchie/ksh by McDutchie 4 years ago
- Use vfork instead of posix_spawn WARNING: This is an experimental fix for the race conditions reported at ksh93/ksh#79 and att/ast#468. There may be bugs that have not yet been caug... — committed to JohnoKing/ksh by JohnoKing 4 years ago
- Use vfork instead of posix_spawn WARNING: This is an experimental fix for the race conditions reported at ksh93/ksh#79 and att/ast#468. There may be bugs that have not yet been caug... — committed to JohnoKing/ksh by JohnoKing 4 years ago
- Fix race conditions running external commands with job control on When ksh is compiled with SHOPT_SPAWN (the default), which uses posix_spawn(3) or vfork(2) (via sh_ntfork()) to launch external comma... — committed to ksh93/ksh by McDutchie 4 years ago
- Failure to launch external commands due to posix_spawn race https://github.com/ksh93/ksh/issues/79 — committed to citrus-it/ast by citrus-it 4 years ago
- posix: Add terminal control setting support for posix_spawn Currently there is no proper way to set the controlling terminal through posix_spawn() in race free manner [1]. This forces shell implemen... — committed to zatrazz/glibc by zatrazz 3 years ago
- posix: Add terminal control setting support for posix_spawn Currently there is no proper way to set the controlling terminal through posix_spawn in race free manner [1]. This forces shell implementa... — committed to kraj/glibc by zatrazz 3 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP & fix spawnveg bugs Note that this is still a WIP. Current TODOs: - Error out if posix_spawnattr_tcsetpgrp_np is detected but posix_spawn is not. ... — committed to JohnoKing/ksh by JohnoKing 2 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP & fix spawnveg bugs This commit implements support for the glibc POSIX_SPAWN_TCSETPGROUP and posix_spawnattr_tcsetpgrp_np extension [1]. This was done wi... — committed to JohnoKing/ksh by JohnoKing 2 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP & fix spawnveg bugs This commit implements support for the glibc POSIX_SPAWN_TCSETPGROUP and posix_spawnattr_tcsetpgrp_np extensions[1]. This was done wi... — committed to JohnoKing/ksh by JohnoKing 2 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP & fix spawnveg bugs This commit implements support for the glibc POSIX_SPAWN_TCSETPGROUP and posix_spawnattr_tcsetpgrp_np extensions[1]. This was done wi... — committed to JohnoKing/ksh by JohnoKing 2 years ago
- Implement support for POSIX_SPAWN_TCSETPGROUP (#438) This commit implements support for the glibc POSIX_SPAWN_TCSETPGROUP and posix_spawnattr_tcsetpgrp_np extensions[1]. This was done with the inte... — committed to ksh93/ksh by JohnoKing 2 years ago
- Support glibc 2.35's posix_spawn_file_actions_addtcsetpgrp_np(3) This commit implements support for the glibc 2.35 posix_spawn_file_actions_addtcsetpgrp_np(3) extension[2][3], updating spawnveg(3) to... — committed to ksh93/ksh by JohnoKing 2 years ago
Here are some new tests on my Mac laptop.
With ksh’s racy use of
posix_spawn:With ksh’s
vforkfallback, after applying the patch based on OpenSUSE’s:…and, for reference, the classic
fork/execthat ksh2020 reverted to:@JohnoKing’s Linux tests and my own Mac tests (above) suggest that
vforkperforms just as well asposix_spawn(if there is any difference, it’s lost in the margin of error), and it doesn’t have these problems. So in terms of performance, this is not going to be a significant loss at all.I just merge this patch [1] and it will be included on glibc 2.35.
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=342cc934a3bf74ac618e2318d738f22ac93257ba
I’ll note that there is currently a patch that adds a
POSIX_SPAWN_TCSETPGROUPflag to glibc: https://patchwork.sourceware.org/project/glibc/patch/20211223173003.1037286-1-adhemerval.zanella@linaro.orgIf that patch gets merged into glibc’s upstream codebase then perhaps we could use that flag to safely avoid usage of fork() in interactive shells.
So, after this slightly desperate megathread, could the fix really be this trivial? Please test…
@hyenias, yes – disabling
SHOPT_SPAWN(by adding-DSHOPT_SPAWN=0to the compiler option) will still revert to the classic fork/exec mechanism, just as it does now. The patch causes the ASTspawnveg()function to use vfork, but disablingSHOPT_SPAWNcausesspawnveg()to not be used at all.That’s what Apple has been doing all along for their /bin/ksh, which is how I know this. I found I had to remove
-DSHOPT_SPAWN=0as a default compiler option fromcc.darwinin order for vfork to be used.Is this 2020 bug relevant?
https://github.com/att/ast/pull/470 which leads to https://github.com/att/ast/issues/468
and the original ast ML discussion: https://www.mail-archive.com/ast-developers@research.att.com/msg00717.html
@aweeraman, hold off a little – this fix exposes a variant of #86 that only occurs with job control active. I’m working out a good way to fix it.
This is brilliant, @McDutchie @JohnoKing et al. Thanks for your persistence!
@chrisbertin and @mirabilos - this will be included in the next upload fairly shortly.
I think it makes sense now. If job control is not active, the terminal process group ID never needs to be set; child processes simply inherit the terminal group from the parent.
However, with job control active, each child process spawned from the main shell environment (not subshells) needs to be in its own terminal process group, or you can’t separate the jobs from each other. But you can’t do this with
posix_spawn. Hence this bug.So it’s clear that this
posix_spawnusage is incompatible with job control and that is why this is so trivial to fix/avoid.What I don’t understand is why it doesn’t fail consistently.
That patch is an effective fix for the race conditions, at least from my own testing.
I did receive the email. I also sent a reply.
I can confirm that the race condition is not reproducible in scripts with either
posix_spawnorvfork; it only occurs in interactive shells (tested on Linux and FreeBSD). This means we can keep usingposix_spawn, but only in scripts.I’ve done some testing and the race condition still occurs on FreeBSD when
spawnvegusesvfork. It looks likeforkwill have to be used there. ~as a reference the other shell that usesvforkon Linux (dash) usesforkon FreeBSD (going off of the output fromtruss)~. The feature test will still need to be fixed though since it fails withfork, even though the race condition doesn’t actually occur withfork.The
ENOTTYerror makes sense: the argument totcgetpgrp(2)is a file descriptor, and 2 is standard error. And standard error is not on a tty; iffe redirects standard error (to /dev/null, probably) so that negative test results don’t clutter the terminal with compiler errors.I know how I would deal with that in shell: I’d use a
2>/dev/ttyredirection. In C I’d probably just open /dev/tty usingopen(2)and use the returned file descriptor instead of 2, something like this:Re the crash (which I can reproduce on Linux but not on the Mac), I think it makes some vague kind of sense that this would terminate an interactive shell, as you’re running
tcsetpgrpon the terminal associated with the interactive shell, which I think could cause an inconsistent state. I could be off base there, but my testing says that it works when run non-interactively, like:I note that on Linux, $? is 0, but on the Mac and FreeBSD, it is 1… bad news?
The
posix_spawnAPI does not have a standard way of setting the terminal group in a child process.POSIX_SPAWN_SETPGROUPexists, but that doesn’t set the terminal group. From Glenn Fowler’s response to a similar question on the old mailing list:Section of code in
spawnvegthat usesPOSIX_SPAWN_SETPGROUP: https://github.com/ksh93/ksh/blob/bd88cc7f4f7207b48104e37cca90848adf5cb10c/src/lib/libast/comp/spawnveg.c#L62-L63As far as extensions to
posix_spawnare concerned, the only one I’m aware of that may fix this problem isPOSIX_SPAWN_TCSETPGROUP, which is only present in QNX and BlackBerry OS: http://www.qnx.com/developers/docs/7.0.0/index.html#com.qnx.doc.neutrino.lib_ref/topic/p/posix_spawn.html https://developer.blackberry.com/native/reference/core/com.qnx.doc.neutrino.lib_ref/topic/p/posix_spawn.htmlI found the trigger for the freeze. It is the
redirect 2>/dev/nullat the end of the test, which suppresses the extra newline ksh prints at the end of an interactive session. Removing that fixes the freeze, but prints an extra newline while running the tests. But an explicitexitalso works to suppress that newline, and does not cause a freeze.Here is a new version of the regression test that avoids the freeze. Note that stderr needs to be on a terminal to trigger a test failure on ksh with
posix_spawnusage.This is the code
spawnvegruns in the child process afterforkandvfork: https://github.com/ksh93/ksh/blob/bd88cc7f4f7207b48104e37cca90848adf5cb10c/src/lib/libast/comp/spawnveg.c#L207-L225tcsetpgrpandioctlaren’t run when launching commands. Insertingerror(ERROR_warn(0)in theif(m)statements can be used to confirm it:This issue can be fixed for
vforkandforkby runningtcsetpgrp(orioctl) beforesetpgid. Below is a quick patch that does that:After applying the above patch, I’m no longer able to reproduce the race condition with
forkorvforkwhenSHOPT_SPAWNis enabled (reproducer). edit: This only fixes the race condition on FreeBSD whenspawnvegusesfork.vforkcan’t be used on FreeBSD, although it works fine on Linux (see https://github.com/ksh93/ksh/issues/79#issuecomment-661426491).Since the thread is now quite long, here are direct links to the OpenSUSE patches: Disable
posix_spawn(fallback tovfork): https://build.opensuse.org/package/view_file/shells/ksh/ksh-qemu.patch?expand=1Disable
vfork(combine with above patch to forcespawnvegto usefork): https://build.opensuse.org/package/view_file/shells/ksh/ksh93-disable-vfork.dif?expand=1I was reading through this issue’s content again. In it, you referenced a Tue, 14 Jun 2011 13:33:15 -0700 email from David Korn that stated:
Is it possible to create a shim program that performs these two lines of code then executes the original program? If so, does that resolve the issue, or would that be too messy and be slower than the fork solution? The shim program might even be able to be a builtin if possible. Perhaps an environment flag passed along to the subshell that tells ksh to fixup the terminal.
vfork() is a pre-VM, or, at least pre-COW, semi-kludge and I thought it would have been retired by now… IMHO…
Chris (chris.bertin@gmail.com or chris.bertin@protonmail.com)
On Sat, Jul 18, 2020 at 9:00 AM Martijn Dekker notifications@github.com wrote:
@JohnoKing: Thank you for your review of the vfork caveats. Would it be possible to introduce the vfork implementation so one could selectively compile using vfork, posix_spawn, or fork [something like
SHOPT_SPAWN]? I think having the ability to alter the way ksh creates a child process may assist in confirming its use and/or troubleshooting in the future. It also might be helpful to have some feature flag identifying it.The
lib_poll_fd_2feature test doesn’t need to be removed to fix theposix_spawnrelated bugs. That change was likely made to fix something specific to QEMU.Some notes regarding ksh usage of
vfork:The libast spawnveg function uses the shared memory of
vforkto its advantage. There is aniffefeature test for detecting this part ofvforkbehavior: https://github.com/ksh93/ksh/blob/37a9c34515c8c966447f75b10a5205934ca51d7d/src/lib/libast/features/lib#L228-L241The
sigcriticalfunction has a comment aboutvfork, which is related to the first caveat in thevforkman page.: https://github.com/ksh93/ksh/blob/37a9c34515c8c966447f75b10a5205934ca51d7d/src/lib/libast/misc/sigcrit.c#L167-L173The multi-threading caveat in the
vforkman page isn’t relevant because ksh is a single-threaded program. There are multi-threaded ksh plugins, but those are incompatible with vmalloc, which ksh uses instead of standard malloc (see https://github.com/att/ast/issues/1449#issuecomment-581157955).If
posix_spawnis removed it shouldn’t be done by removing theSHOPT_SPAWNifdef; OpenSUSE’s patch should be instead. Results from performance testing on my Linux machine (ksh is compiled withCCFLAGS=-O2):Relevant
straceoutput fromksh -c '/usr/bin/true; /usr/bin/true':The OpenSUSE patch causes
vforkto be used instead ofposix_spawnwhich appears to increase performance slightly while still fixing the exec format error. RemovingSHOPT_SPAWNcauses the OS implementation offorkto be used instead ofvforkandposix_spawn, which is slower (on Linuxforkuses clone, but that doesn’t make up for the performance deficit).Pointing out, tho, that if you follow those 2020 bugs all the way through, they lead you back to the performance degradation that was an large part of the original ksh2020 kerfuffle.
On macOS, the compiler flags used by the Apple version (which I’ve incorporated into
src/cmd/INIT/cc.darwin) disableSHOPT_SPAWN. I can confirm that, if I change that flag to-DSHOPT_SPAWN=1and recompile from scratch, I can reproduce this bug on the Mac.I can also reproduce this bug on FreeBSD, on which
SHOPT_SPAWNis not disabled by default.Neither of these two systems use GNU libc, so it seems likely that this is a bug in ksh and not in any C library. Does anyone here have an idea how to fix it?
Alternatively, I wonder if there would be any disadvantage to simply removing the use of
posix_spawnaltogether.I can confirm that disabling
posix_spawncauses this bug to go away. OpenSUSE disables it in their build of ksh with a patch titled ksh-qemu.patch, which implies there is another bug related toposix_spawnthat affects QEMU.