tesseract: Segmentation fault when initializing with null language

Basic Information

tesseract 5.2.0 leptonica-1.82.0 libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.12 : libwebp 1.3.0 Found AVX2 Found AVX Found FMA Found SSE4.1

Operating System

No response

Other Operating System

Fedora Linux 37

But this was originally reported to me from a user on a Mac M1 (presumably macOS 13 Ventura).

uname -a

Linux fedora-desktop 6.1.14-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Feb 26 00:13:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Compiler

gcc version 12.2.1 20221121 (Red Hat 12.2.1-4) (GCC)

Virtualization / Containers

No response

CPU

13th Gen Intel® Core™ i7-13700K

Current Behavior

When using TessBaseAPIInit3(cube, NULL, NULL) the language isn’t set to a sensible default, thus later causing a segmentation fault when TessBaseAPIRecognize is called.

Expected Behavior

Given that the documentation says:

The language is (usually) an ISO 639-3 string or nullptr will default to eng.

I would expect a NULL to work the same way as "eng" (not segmentation fault at the Recognize step).

Suggested Fix

Null pointer defaults to “eng”.

Other Information

Test case program

#include <tesseract/capi.h>
#include <leptonica/allheaders.h>

int main(int argc, char *argv[]) {
    TessBaseAPI *cube = TessBaseAPICreate();
    TessBaseAPIInit3(cube, NULL, NULL); // change this 2nd `NULL` to "eng" for success

    PIX *image = pixRead("img.png");
    TessBaseAPISetImage2(cube, image);
    TessBaseAPIRecognize(cube, NULL);
    char *text = TessBaseAPIGetUTF8Text(cube);
    printf("%s\n", text);
    TessDeleteText(text);
    pixFreeData(image);
    TessBaseAPIDelete(cube);
}

run using gcc $(pkg-config --cflags --libs tesseract) $(pkg-config --cflags --libs lept) test.c && ./a.out.

This was originally reported against a Rust wrapper: https://github.com/antimatter15/tesseract-rs/issues/34

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

CMake and Autotools should behave similarly, otherwise you make supporting the software more difficult.

Currently they use different templates for lept.pc which results in different compiler flags for the include path. See template for CMake and template for Autotools.

The lept.pc template for CMake should be fixed to fit the template for Autotools.

I have always used variant 1. Both in the library and for the 300 or so programs in the prog/ directory.

Never considered variant 2, which wouldn’t work with any of my code because I’m using specific local builds (not installed software) when developing and testing.

#include <leptonica/allheaders.h>

The right form is #include <allheaders.h>.

The API was changed by my commit f5d22d0bc (“Don’t set a default language in TessBaseAPI::Init”). The reason for that commit was that Tesseract required (and loaded) eng.traineddata even for tasks which did not require a model file.

So the documentation should be updated, and of course the code should not crash.