ksh: Intermittent coprocess hang on Debian/Ubuntu and Solaris

On roughly 1 in a 100-ish regression test runs, the following failure occurs:

test coprocess begins at 2020-09-11+19:51:40
	coprocess.sh[235]: /bin/cat coprocess 2 hung
test coprocess failed at 2020-09-11+19:51:48 with exit code 1 [ 34 tests 1 error ]

The test in question is: https://github.com/ksh93/ksh/blob/9f2066f146ae0d3ea733d9cccf5031273fb28e21/src/cmd/ksh93/tests/coprocess.sh#L227-L237

This has shown up on my Mac just once or twice, but it shows up more often on the Github CI test runners.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 21

Commits related to this issue

tests/coprocess.sh: temp disable known intermittent fail Export DEBUG_COPROCESS=y to include it in the tests. See: https://github.com/ksh93/ksh/issues/132 — committed to ksh93/ksh by McDutchie 4 years ago
tests/coprocess.sh: activate known intermittent fail as warning https://github.com/ksh93/ksh/issues/132#issuecomment-997432781 — committed to ksh93/ksh by McDutchie 2 years ago
tests/coprocess.sh: activate known intermittent fail as warning https://github.com/ksh93/ksh/issues/132#issuecomment-997432781 — committed to ksh93/ksh by McDutchie 2 years ago
ksh93u+: prevent hanging iffe tests on Solaris 11. This probably applies on Ubuntu, Debian, NetBSD-8 and other systems whose standard intepreter doesn't get along well with the scripts included in the... — committed to NetBSD/pkgsrc-wip by deleted user 2 years ago
Fix race condition in coprocess test with external 'cat' The race is between '$cat |&' and 'kill $pid'. In between, there are only a variable assignment and two buffered writes, so there is nothing t... — committed to ksh93/ksh by McDutchie 2 years ago

Most upvoted comments

I would not enable it by default as it would cause the Github CI test runners to fail

Yes, hence my intention to make the failure a warning, i.e., it will not be counted as a failure, but still show up in the output.

McDutchie on Dec 19, 2021

So, I recommend adding in some sleeps as follows to correct whatever on Debian systems that need more time to deal with the coprocess and/or its pipes.

That would defeat the purpose of the regression test, as it would fail to expose the bug. So we might as well not bother running it at all then.

Given that it’s only reproducible on a few specific systems (which unfortunately includes Debian and derivatives, which are popular), I suspect this is a race condition in the respective kernels, not in ksh.

McDutchie on Oct 4, 2020

On my Ubuntu based systems (Intel and ARM), introducing a sleep of .01 between commands allows the coprocess script to complete without error even on my slowest box. I used head instead of cat in the following code snippet while in an interactive session to validate each test by altering the number of lines given to head and various sleeps.

Passing head -n #:

2: All tests pass and the head coprocess terminates by itself. [good]
3: head coprocess has to be terminated manually as it is still waiting on more input [expected, good]
0 with no sleeps: head process failed to terminate [bad, head -n 0 still running and not had a chance to exit]
0 and only having a sleep before 1st print: both prints fail [good]
1 with only sleep before 1st print: both prints pass, expected 2nd print to fail [bad]
1 with sleeps before each print: second print fails [good]
2 with sleeps before each print: head process fails to terminate [bad]
2 with sleeps before each print and the jobs command: all pass [good]

head -n 2 |&; pid=$!
sleep .01
print foo >&p 2> /dev/null || echo 'first write of foo to head coprocess failed'
sleep .01
print foo >&p 2> /dev/null || echo 'second write of foo to head coprocess failed'
sleep .01
jobs $pid 2> /dev/null && { echo 'head coprocess failed to terminate'; kill $pid ;}
wait $pid 2> /dev/null

So, I recommend adding in some sleeps as follows to correct whatever on Debian systems that need more time to deal with the coprocess and/or its pipes. It may even be something related to job control–I just do not know. The 3rd sleep is not needed as the wait command provides a pause for the system to catch up.

$cat |&
pid=$!
sleep .01
print foo >&p 2> /dev/null || err_exit "first write of foo to $cat coprocess failed"
sleep .01
print foo >&p 2> /dev/null || err_exit "second write of foo to coprocess failed"
kill $pid
wait $pid 2> /dev/null

hyenias on Oct 3, 2020

~~Some good news.~~ I recently got me a CentOS 8 virtual machine so I can test Red Hat’s /bin/ksh93 with all their patches. And the hang is not reproducible there. Iteration 80000 and counting…

Now that @kdudka has made sure I have access to the relevant non-public Red Hat bugs, I can simply continue my work of porting all those Red Hat patches to 93u+m, devising regression tests for them, etc. and by the time that work is complete, this bug should have gone away.

McDutchie on Sep 26, 2020