tesseract: recent change setlocale in baseapi.c causes Python loaded tesseract library to fail
Ubuntu 16.04, default locale is “en_US.UTF-8”. Invoke tesseract library via cffi. Now fail with following
error:
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192
It worked fine before baseapi.cpp locale assertion was introduce in commit 3292484f67af8bdda23aa5e510918d0115785291 on 06/07/18.
Any suggestion to get around this issue? Thx.
C or C++ program seems to set default locale “C”, however, it’s not the case for python, where default is “en_US.UTF-8”.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 39 (17 by maintainers)
Commits related to this issue
- Updates to latest step tag This includes a fix for segfault on init problem mentioned by these two issues: https://github.com/tesseract-ocr/tesseract/issues/1670 https://github.com/tesseract-ocr/tes... — committed to rhardih/bad by rhardih 5 years ago
- Updates to latest step tag This includes a fix for segfault on init problem mentioned by these two issues: https://github.com/tesseract-ocr/tesseract/issues/1670 https://github.com/tesseract-ocr/tes... — committed to rhardih/bad by rhardih 5 years ago
- Add new default Tesseract OCR backend This new backend uses a command call to avoid Tesseract bug 1670 (https://github.com/tesseract-ocr/tesseract/issues/1670). Signed-off-by: Roberto Rosario <rober... — committed to mayan-edms/Mayan-EDMS by siloraptor 5 years ago
- screenshots: remove locale workaround for tesseract It should not be needed since tesseract 4.1, see https://github.com/tesseract-ocr/tesseract/issues/1670 — committed to WeblateOrg/weblate by nijel 8 months ago
This is going to cause huge problems for people who are running Tesseract as a library. Setting locale=“C” will probably cause various unwanted side-effects throughout the application.
Setting/resetting locale for the duration of Tesseract API calls is also problematic in multithreaded applications, for example.
I suggest that instead of requiring locale=“C”, to change Tesseract to use something other than sscanf() for parsing strings in a locale-independent way.
Even for C/C++ I usually call
as the first thing in
main, which sets the locale to the value specified in the environment.Depending on
"C"locale seems quite bad to me.Pull request #2420 replaces
strtofandstrtodwhich fixes more dependencies on the locale settings. The criticalsscanfcalls were already replaced by earlier commits.I think we can now consider removing the assertion as soon as we have tested that the issues #1250 and #1532 are still fixed.
My current workaround for this looks like this:
Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale:
atoi,isspace,strtod,strtof,strtol,sscanf.2018-10-08:
printf,fprintand other*printfneed fixes for formatting of float and double values.https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata
buster:
deb https://notesalexp.org/tesseract-ocr/buster/ buster maincosmic:deb https://notesalexp.org/tesseract-ocr/cosmic/ cosmic mainFetch and install the GnuPG key
Workaround for python users
There was no recent activity and I think everything was answered, so I close it now.
@laurikari, I agree. As soon as all *scanf code is replaced by code which does not depend on the locale, the assertions can be removed. We just had to make sure now that people don’t get wrong results without any notice.