runtime: Inconsistent "Error while reaping child" when running multiple instances of Tesseract concurrently in Ubuntu via Process
I have been wrangling with this issue for a week and a half now. I have a .NET Core 2.1 (I have tried updating to 3.1, but it did not change the behavior) service that runs on an Ubuntu 18.04 VM which calls Tesseract as an external process. This service worked in the past, and I have a working version in production. However, apparently something I have done has caused this to stop working consistently, and I am unable to determine what. The service splits document files and calls Tesseract to OCR the pages individually so that multiple pages can be processed concurrently, then pieces them back together at the end. I have a configurable thread limit on this service – if I set it to 1, the process takes roughly twice as long but it works. If I set it to 8 (our usual value), I will randomly get this error, which is not handle-able due to being raised by a Process event:
dotnet[16899]: Error while reaping child. errno = 10
dotnet[16899]: at System.Environment.FailFast(System.String, System.Exception)
dotnet[16899]: at System.Environment.FailFast(System.String)
dotnet[16899]: at System.Diagnostics.ProcessWaitState.TryReapChild()
dotnet[16899]: at System.Diagnostics.ProcessWaitState.CheckChildren(Boolean)
dotnet[16899]: at System.Diagnostics.Process.OnSigChild(Boolean)
It could happen on the first or last page of a document. It could happen on a document with 3 blank pages, or a faxed document with 133 pages. There is, from what I can discern, no pattern to this error. I was convinced that this was an error in my code, due to the fact that it is not happening in production, but the more I test this the more it seems like it might be an issue with the way .NET Core handles processes in Unix-based OSes. I’m welcome to any suggestions to change my code or server settings (I don’t know very much about server configuration, admittedly) because that would be a lot easier to fix than the alternative.
Here is what the code looks like.
public string AddTextLayerToDocumentFile(string inputFilePath, string outputFilePath, object instanceId, string outputLogPath = null)
{
// pages are formatted and written to individual files so they can be fed to Tesseract ...
// no issues with that part of the code as far as I can tell
var exceptionCounter = 0L;
var exceptionalPageNumber = 0;
Parallel.ForEach(pagePaths, parallelOptions, (pagePath) =>
{
var pageNumber = Convert.ToInt32(Path.GetFileNameWithoutExtension(pagePath).Split("_")[1]);
_pdfProcessor.AddTextLayerToPage(
pagePath,
instanceId,
ref exceptionCounter,
ref exceptionalPageNumber);
});
if (exceptionCounter > 0L)
throw new Exception($"[Thread {instanceId}] Unable to process [{exceptionalPageNumber}] of document [{Path.GetDirectoryName(inputFilePath)}].");
// ... get the list of processed pages and join them back into one, then return the path to that file
}
public bool AddTextLayerToPage(string pagePath, object instanceId, ref long exceptionCounter, ref int exceptionalPageNumber)
{
// ... variable initialization and filename parsing stuff
result = ProcessOcrPage(pagePath, pageCounter, instanceId))
// ... some formatting of the post-OCR'd page
}
private bool ProcessOcrPage(string inputPageImagePath, int pageNumber, object instanceId)
{
StringBuilder output = new StringBuilder();
StringBuilder error = new StringBuilder();
int exitCode;
string outputPageFilePathWithoutExt = Path.Combine(_fileOps.GetThreadOutputDirectory(instanceId),
$"pg_{pageNumber.ToString().PadLeft(3, '0')}");
var cmdArgs = $"\"{inputPageImagePath}\" \"{outputPageFilePathWithoutExt}\" --oem 1 -l eng pdf";
_logger.LogStatement($"[Thread {instanceId}.{pageNumber}] Executing the following command:{Environment.NewLine}tesseract {cmdArgs}", LogLevel.Debug);
try
{
using (var process = new Process())
{
process.StartInfo = new ProcessStartInfo
{
WindowStyle = ProcessWindowStyle.Hidden,
FileName = "tesseract",
RedirectStandardOutput = true,
RedirectStandardError = true,
UseShellExecute = false,
CreateNoWindow = true,
Arguments = cmdArgs
};
process.StartInfo.EnvironmentVariables.Add("TESSDATA_PREFIX", _config.TesseractTrainingDataPath);
if (_config.TesseractThreadLimit > 0)
process.StartInfo.EnvironmentVariables.Add("OMP_THREAD_LIMIT", _config.TesseractThreadLimit.ToString());
process.OutputDataReceived += (sender, e) =>
{
if (e.Data != null)
output.AppendLine(e.Data);
};
process.ErrorDataReceived += (sender, e) =>
{
if (e.Data != null)
error.AppendLine(e.Data);
};
process.Start();
process.BeginOutputReadLine();
process.BeginErrorReadLine();
process.WaitForExit();
exitCode = process.ExitCode;
_logger.LogStatement($"[Thread {instanceId}.{pageNumber}] Tesseract exited with code [{exitCode}]", LogLevel.Trace, nameof(ProcessOcrPage));
var standardOut = output.ToString();
var standardErr = error.ToString();
if (!string.IsNullOrEmpty(standardOut))
_logger.LogStatement($"[Thread {instanceId}.{pageNumber}] Tesseract stdOut:\n{standardOut}", LogLevel.Debug, nameof(ProcessOcrPage));
if (!string.IsNullOrEmpty(standardErr))
_logger.LogStatement($"[Thread {instanceId}.{pageNumber}] Tesseract stdErr:\n{standardErr}", LogLevel.Debug, nameof(ProcessOcrPage));
process.Close();
return exitCode == 0 ? true : false;
}
}
catch (Exception e)
{
_logger.LogException(e, instanceId);
return false;
}
}
}
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 19 (6 by maintainers)
@tmds the above helped us get unblocked also (we were also using waitpid to reap processes started by .NET). Thanks!
@tmds thanks again, the input not only helped us solve the real root cause for our case but also ended as a PR to another OSS project: https://github.com/WiringPi/WiringPi/pull/101.
Yes, the native library reaps a child process that was started by .NET, and .NET doesn’t like that.
The native library can wait for the specific child it started using
waitpid
(instead ofwait
)..NET won’t reap the child processes started by the native library.
@freddyrios can you trace your program as follows, until it terminates with the
Error while reaping child. errno = 10
message?If we look at something simple like:
The output of strace is this:
You can see the child process being started (
execve
) with the pid 48698. When it exits there is aSIGCHLD
, and .NET goes looking for any child that terminated usingwaitid
. When a child is found (waitid
returns0
, child pid insi_pid
), then it reaps the process usingwait4
. The final call towaitid
is .NET checking if there are other children that terminated. There are none (return value isECHILD
).The error message
Error while reaping child. errno = 10
probably means thewait4
call returnsECHILD
. We should see that in your trace.@amsoedal if it helps, you can check the linked PR above for how I worked around the issue in our case. Specifically changes to DULutil.
@36PopTarts did you find any work arounds for this?
I faced similar running unrelated simple commands from different threads on a pi. I will likely do a single child process at a time and move on, but it would be great not to have such limitations.
I also think this looks like a bug in how processes are handled, but one that might be caused by limitations down the stack. Here is a python bug where they talk a lot about an issue that looks very similar: https://bugs.python.org/issue1731717