tesseract: recent change setlocale in baseapi.c causes Python loaded tesseract library to fail

Ubuntu 16.04, default locale is “en_US.UTF-8”. Invoke tesseract library via cffi. Now fail with following error: !strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192

It worked fine before baseapi.cpp locale assertion was introduce in commit 3292484f67af8bdda23aa5e510918d0115785291 on 06/07/18.

Any suggestion to get around this issue? Thx.

C or C++ program seems to set default locale “C”, however, it’s not the case for python, where default is “en_US.UTF-8”.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 39 (17 by maintainers)

Commits related to this issue

Most upvoted comments

This is going to cause huge problems for people who are running Tesseract as a library. Setting locale=“C” will probably cause various unwanted side-effects throughout the application.

Setting/resetting locale for the duration of Tesseract API calls is also problematic in multithreaded applications, for example.

I suggest that instead of requiring locale=“C”, to change Tesseract to use something other than sscanf() for parsing strings in a locale-independent way.

Even for C/C++ I usually call

setlocale(LC_CTYPE, "");

as the first thing in main, which sets the locale to the value specified in the environment.

Depending on "C" locale seems quite bad to me.

Pull request #2420 replaces strtof and strtod which fixes more dependencies on the locale settings. The critical sscanf calls were already replaced by earlier commits.

I think we can now consider removing the assertion as soon as we have tested that the issues #1250 and #1532 are still fixed.

My current workaround for this looks like this:

from locale import setlocale
from contextlib import contextmanager

@contextmanager
def c_locale(reset_to="C.UTF-8"):
    setlocale(locale.LC_CTYPE, "C")
    yield
    setlocale(locale.LC_CTYPE, reset_to)
    
with c_locale():
    from tesserocr import PyTessBaseAPI
    with PyTessBaseAPI() as api:
        api.Init(lang="deu")
        api.SetImage(box_image)
        ocr_result = api.GetUTF8Text()
        print(ocr_result)

Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.

2018-10-08: printf, fprint and other *printf need fixes for formatting of float and double values.

buster: deb https://notesalexp.org/tesseract-ocr/buster/ buster main cosmic: deb https://notesalexp.org/tesseract-ocr/cosmic/ cosmic main

Fetch and install the GnuPG key

sudo apt-get update -oAcquire::AllowInsecureRepositories=true
sudo apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true
sudo apt-get update

Workaround for python users

import locale
locale.setlocale(locale.LC_CTYPE, 'C')  # set locale to C
import tesserocr
locale.setlocale(locale.LC_CTYPE, '')  # set locale back

There was no recent activity and I think everything was answered, so I close it now.

@laurikari, I agree. As soon as all *scanf code is replaced by code which does not depend on the locale, the assertions can be removed. We just had to make sure now that people don’t get wrong results without any notice.