tesseract: recent change setlocale in baseapi.c causes Python loaded tesseract library to fail

Ubuntu 16.04, default locale is “en_US.UTF-8”. Invoke tesseract library via cffi. Now fail with following error: !strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192

It worked fine before baseapi.cpp locale assertion was introduce in commit 3292484f67af8bdda23aa5e510918d0115785291 on 06/07/18.

Any suggestion to get around this issue? Thx.

C or C++ program seems to set default locale “C”, however, it’s not the case for python, where default is “en_US.UTF-8”.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 39 (17 by maintainers)

Commits related to this issue

Updates to latest step tag This includes a fix for segfault on init problem mentioned by these two issues: https://github.com/tesseract-ocr/tesseract/issues/1670 https://github.com/tesseract-ocr/tes... — committed to rhardih/bad by rhardih 5 years ago
Updates to latest step tag This includes a fix for segfault on init problem mentioned by these two issues: https://github.com/tesseract-ocr/tesseract/issues/1670 https://github.com/tesseract-ocr/tes... — committed to rhardih/bad by rhardih 5 years ago
Add new default Tesseract OCR backend This new backend uses a command call to avoid Tesseract bug 1670 (https://github.com/tesseract-ocr/tesseract/issues/1670). Signed-off-by: Roberto Rosario <rober... — committed to mayan-edms/Mayan-EDMS by siloraptor 5 years ago
screenshots: remove locale workaround for tesseract It should not be needed since tesseract 4.1, see https://github.com/tesseract-ocr/tesseract/issues/1670 — committed to WeblateOrg/weblate by nijel 8 months ago

Most upvoted comments

This is going to cause huge problems for people who are running Tesseract as a library. Setting locale=“C” will probably cause various unwanted side-effects throughout the application.

Setting/resetting locale for the duration of Tesseract API calls is also problematic in multithreaded applications, for example.

I suggest that instead of requiring locale=“C”, to change Tesseract to use something other than sscanf() for parsing strings in a locale-independent way.

laurikari on Jun 18, 2018

Even for C/C++ I usually call

setlocale(LC_CTYPE, "");

as the first thing in main, which sets the locale to the value specified in the environment.

Depending on "C" locale seems quite bad to me.

troplin on Jun 22, 2018

Pull request #2420 replaces strtof and strtod which fixes more dependencies on the locale settings. The critical sscanf calls were already replaced by earlier commits.

I think we can now consider removing the assertion as soon as we have tested that the issues #1250 and #1532 are still fixed.

stweil on May 2, 2019

My current workaround for this looks like this:

from locale import setlocale
from contextlib import contextmanager

@contextmanager
def c_locale(reset_to="C.UTF-8"):
    setlocale(locale.LC_CTYPE, "C")
    yield
    setlocale(locale.LC_CTYPE, reset_to)
    
with c_locale():
    from tesserocr import PyTessBaseAPI
    with PyTessBaseAPI() as api:
        api.Init(lang="deu")
        api.SetImage(box_image)
        ocr_result = api.GetUTF8Text()
        print(ocr_result)

ephes on Oct 20, 2018

Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.

2018-10-08: printf, fprint and other *printf need fixes for formatting of float and double values.

stweil on Oct 8, 2018

https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata

amitdo on Dec 29, 2019

buster: deb https://notesalexp.org/tesseract-ocr/buster/ buster main cosmic: deb https://notesalexp.org/tesseract-ocr/cosmic/ cosmic main

Fetch and install the GnuPG key

sudo apt-get update -oAcquire::AllowInsecureRepositories=true
sudo apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true
sudo apt-get update

AlexanderP on Dec 29, 2019

Workaround for python users

import locale
locale.setlocale(locale.LC_CTYPE, 'C')  # set locale to C
import tesserocr
locale.setlocale(locale.LC_CTYPE, '')  # set locale back

wd on Dec 28, 2019

There was no recent activity and I think everything was answered, so I close it now.

stweil on Dec 5, 2019

@laurikari, I agree. As soon as all *scanf code is replaced by code which does not depend on the locale, the assertions can be removed. We just had to make sure now that people don’t get wrong results without any notice.

stweil on Jun 18, 2018