tesseract: unicharset_extractor segfault

Current Behavior

After I built from the current main with debug symbols (configure --disable-openmp --enable-debug --disable-shared CXXFLAGS="-g -O0 -fsanitize=address,undefined -fstack-protector-strong -ftrapv"), trying to use tesstrain immediately segfaults on the unicharset_extractor step (all-gt is 313k, norm_mode=2, nothing unusual):

    #0 0x7faa6352b17e in std::filesystem::__cxx11::path::compare(std::filesystem::__cxx11::path const&) const (/lib/x86_64-linux-gnu/libstdc++.so.6+0x19017e)
    #1 0x562c491ddc50 in std::filesystem::__cxx11::operator==(std::filesystem::__cxx11::path const&, std::filesystem::__cxx11::path const&) (/data/ocr-d/ocrd_all/venv38/bin/unicharset_extractor+0x2556c50)
    #2 0x562c491dc60d in Main /data/ocr-d/ocrd_all/tesseract/src/training/unicharset_extractor.cpp:74
    #3 0x562c491dd09d in main /data/ocr-d/ocrd_all/tesseract/src/training/unicharset_extractor.cpp:120
    #4 0x7faa625df6c9  (/lib/x86_64-linux-gnu/libc.so.6+0x276c9)
    #5 0x7faa625df784 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x27784)
    #6 0x562c491db8c0 in _start (/data/ocr-d/ocrd_all/venv38/bin/unicharset_extractor+0x25548c0)

I compiled with g++ 8.3.0.

Judging by the stack trace, there is some non-interopability with the C++ path library here…

Expected Behavior

The unicharset_extractor to exit normally, producing output.

Suggested Fix

No response

tesseract -v

tesseract 5.3.4
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX512BW
 Found AVX512F
 Found AVX512VNNI
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 liblz4/1.8.3 libzstd/1.3.8
 Found libcurl/7.64.0 NSS/3.42.1 zlib/1.2.11 libidn2/2.0.5 libpsl/0.20.2 (+libidn2/2.0.5) libssh2/1.11.0 nghttp2/1.59.0 librtmp/2.3

Operating System

Debian 11 Bullseye

Other Operating System

No response

uname -a

GNU/Linux x86_64

Compiler

g++ 8.3.0

CPU

Intel Xeon Gold

Virtualization / Containers

VMWare

Other Information

No response

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 31 (20 by maintainers)

Most upvoted comments

Personally, I don’t think we should care about GCC 8 anymore.

The Linux distros that have GCC 8.x as their default compiler:

  • Debian 10 ‘buster’ (oldoldstable). Note that the Debian project does not support buster anymore. A third party organization provides extended security support for buster.
  • RHEL 8 (and its clones). GCC 13 is also available (gcc-toolset in AppStream).