ksh: Intermittent coprocess hang on Debian/Ubuntu and Solaris
On roughly 1 in a 100-ish regression test runs, the following failure occurs:
test coprocess begins at 2020-09-11+19:51:40
coprocess.sh[235]: /bin/cat coprocess 2 hung
test coprocess failed at 2020-09-11+19:51:48 with exit code 1 [ 34 tests 1 error ]
The test in question is: https://github.com/ksh93/ksh/blob/9f2066f146ae0d3ea733d9cccf5031273fb28e21/src/cmd/ksh93/tests/coprocess.sh#L227-L237
This has shown up on my Mac just once or twice, but it shows up more often on the Github CI test runners.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 21
Commits related to this issue
- tests/coprocess.sh: temp disable known intermittent fail Export DEBUG_COPROCESS=y to include it in the tests. See: https://github.com/ksh93/ksh/issues/132 — committed to ksh93/ksh by McDutchie 4 years ago
- tests/coprocess.sh: activate known intermittent fail as warning https://github.com/ksh93/ksh/issues/132#issuecomment-997432781 — committed to ksh93/ksh by McDutchie 2 years ago
- tests/coprocess.sh: activate known intermittent fail as warning https://github.com/ksh93/ksh/issues/132#issuecomment-997432781 — committed to ksh93/ksh by McDutchie 2 years ago
- ksh93u+: prevent hanging iffe tests on Solaris 11. This probably applies on Ubuntu, Debian, NetBSD-8 and other systems whose standard intepreter doesn't get along well with the scripts included in the... — committed to NetBSD/pkgsrc-wip by deleted user 2 years ago
- Fix race condition in coprocess test with external 'cat' The race is between '$cat |&' and 'kill $pid'. In between, there are only a variable assignment and two buffered writes, so there is nothing t... — committed to ksh93/ksh by McDutchie 2 years ago
Yes, hence my intention to make the failure a warning, i.e., it will not be counted as a failure, but still show up in the output.
That would defeat the purpose of the regression test, as it would fail to expose the bug. So we might as well not bother running it at all then.
Given that it’s only reproducible on a few specific systems (which unfortunately includes Debian and derivatives, which are popular), I suspect this is a race condition in the respective kernels, not in ksh.
On my Ubuntu based systems (Intel and ARM), introducing a sleep of .01 between commands allows the coprocess script to complete without error even on my slowest box. I used head instead of cat in the following code snippet while in an interactive session to validate each test by altering the number of lines given to head and various sleeps.
Passing head -n #:
So, I recommend adding in some sleeps as follows to correct whatever on Debian systems that need more time to deal with the coprocess and/or its pipes. It may even be something related to job control–I just do not know. The 3rd sleep is not needed as the wait command provides a pause for the system to catch up.
Some good news.I recently got me a CentOS 8 virtual machine so I can test Red Hat’s /bin/ksh93 with all their patches. And the hang is not reproducible there. Iteration 80000 and counting…Now that @kdudka has made sure I have access to the relevant non-public Red Hat bugs, I can simply continue my work of porting all those Red Hat patches to 93u+m, devising regression tests for them, etc. and by the time that work is complete, this bug should have gone away.