tesseract: Tesseract seemingly stuck

Environment

  • Tesseract Version: Latest master
  • Commit Number: (23ed59bd7bca777e4e104c4ee540843373aa9869
  • Platform: Linux gentoo-x13 5.11.7-gentoo-dist #1 SMP Wed Mar 17 21:03:41 -00 2021 x86_64 AMD Ryzen 7 PRO 4750U with Radeon Graphics AuthenticAMD GNU/Linux

Current Behavior:

Tesseract hangs, seemingly never finishes

Expected Behavior:

Tesseract doesn’t hang and produces output normally

GDB backtrace (interrupted after more than 5 minutes):

merlijn@gentoo-x13 ~/archive/tesseract-src/tesseract $ time TESSDATA_PREFIX=/usr/share/tessdata LD_LIBRARY_PATH=`pwd` LD_LIBRARY_PATH=$LD_LIBARY_PATH:`pwd`/.libs gdb --args ./.libs/tesseract /tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm - hocr
GNU gdb (Gentoo 10.1 vanilla) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./.libs/tesseract...
(gdb) r
Starting program: /home/merlijn/archive/tesseract-src/tesseract/.libs/tesseract /tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm - hocr
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.0-alpha-20201231-545-g23ed5' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
Estimating resolution as 246
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7ec49b6 in tesseract::CLIST_ITERATOR::forward (this=this@entry=0x7fffffffbec0)
    at src/ccutil/clst.cpp:265
265	  return current->data;
(gdb) bt
#0  0x00007ffff7ec49b6 in tesseract::CLIST_ITERATOR::forward (this=this@entry=0x7fffffffbec0)
    at src/ccutil/clst.cpp:265
#1  0x00007ffff7ec4b8d in tesseract::CLIST::add_sorted (this=<optimized out>,
    comparator=comparator@entry=
    0x7ffff7d86b90 <tesseract::SortByBoxLeft<tesseract::ColPartition>(void const*, void const*)>,
    unique=unique@entry=true, new_data=<optimized out>, new_data@entry=0x5555bf4f66b0)
    at src/ccutil/clst.cpp:176
#2  0x00007ffff7e2cdf7 in tesseract::BBGrid<tesseract::ColPartition, tesseract::ColPartition_CLIST, tesseract::ColPartition_C_IT>::InsertBBox (this=this@entry=0x55555b031130, h_spread=h_spread@entry=true,
    v_spread=v_spread@entry=true, bbox=0x5555bf4f66b0) at src/textord/bbgrid.h:551
#3  0x00007ffff7e3f664 in tesseract::ColPartitionGrid::ComputeTotalOverlap (
    this=this@entry=0x5555555aec68, overlap_grid=overlap_grid@entry=0x7fffffffc158)
    at src/textord/colpartitiongrid.cpp:329
#4  0x00007ffff7e71620 in tesseract::StrokeWidth::DetectAndRemoveNoise (this=0x55555558c420,
    pre_overlap=95268, grid_box=..., block=0x55555558bf20, part_grid=0x5555555aec68,
    diacritic_blobs=0x7fffffffc688) at src/textord/strokewidth.cpp:1350
#5  0x00007ffff7e729da in tesseract::StrokeWidth::FindInitialPartitions (
    this=this@entry=0x55555558c420, pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO,
    rerotation=..., find_problems=find_problems@entry=true, block=block@entry=0x55555558bf20,
    diacritic_blobs=diacritic_blobs@entry=0x7fffffffc688, part_grid=0x5555555aec68,
    big_parts=0x5555555aec98, skew_angle=0x7fffffffc340) at src/textord/strokewidth.cpp:1310
#6  0x00007ffff7e72c08 in tesseract::StrokeWidth::GradeBlobsIntoPartitions (this=0x55555558c420,
    pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO, rerotation=...,
    block=block@entry=0x55555558bf20, nontext_pix=..., denorm=<optimized out>, cjk_script=false,
    projection=0x5555555aecc0, diacritic_blobs=0x7fffffffc688, part_grid=0x5555555aec68,
    big_parts=0x5555555aec98) at src/textord/strokewidth.cpp:379
#7  0x00007ffff7e2be71 in tesseract::ColumnFinder::FindBlocks (this=this@entry=0x5555555aeb30,
    pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO, scaled_color=...,
    scaled_factor=<optimized out>, input_block=input_block@entry=0x55555558bf20, photo_mask_pix=...,
    thresholds_pix=..., grey_pix=..., pixa_debug=0x7ffff7c6abd0, blocks=0x7fffffffc5e8,
    diacritic_blobs=0x7fffffffc688, to_blocks=0x7fffffffc690) at src/textord/colfind.cpp:296
#8  0x00007ffff7d5509b in tesseract::Tesseract::AutoPageSeg (this=0x7ffff7c47010,
    pageseg_mode=tesseract::PSM_AUTO, blocks=0x5555555b0c90, to_blocks=0x7fffffffc690,
    diacritic_blobs=0x7fffffffc688, osd_tess=<optimized out>, osr=0x7fffffffca40)
    at src/ccmain/pagesegmain.cpp:226
#9  0x00007ffff7d5555d in tesseract::Tesseract::SegmentPage (this=0x7ffff7c47010,
    input_file=<optimized out>, blocks=0x5555555b0c90, osd_tess=osd_tess@entry=0x0,
    osr=osr@entry=0x7fffffffca40) at src/ccmain/pagesegmain.cpp:140
#10 0x00007ffff7d227bf in tesseract::TessBaseAPI::FindLines (this=0x7fffffffd780)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/bits/basic_string.h:2300
#11 0x00007ffff7d24f64 in tesseract::TessBaseAPI::Recognize (this=0x7fffffffd780, monitor=0x0)
    at src/api/baseapi.cpp:838
#12 0x00007ffff7d2552a in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x7fffffffd780,
    pix=0x5555555b1c50, page_index=page_index@entry=0,
    filename=filename@entry=0x7fffffffdf54 "/tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm",
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=
    0x5555555a2810) at src/api/baseapi.cpp:1259
#13 0x00007ffff7d26172 in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x7fffffffd780,
    filename=<optimized out>, retry_config=0x0, timeout_millisec=0, renderer=0x5555555a2810)
    at src/api/baseapi.cpp:1218
#14 0x00007ffff7d2673f in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffd780,
    filename=filename@entry=0x7fffffffdf54 "/tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm",
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=<optimized out>) at src/api/baseapi.cpp:1071
#15 0x0000555555558295 in main (argc=<optimized out>, argv=<optimized out>)
    at src/api/tesseractmain.cpp:783

Image: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008.ppm

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 28 (21 by maintainers)

Most upvoted comments

With the code from #3418, the processing ends after 4:30 minutes, when Sauvola binarization is used. The output looks good.

Note that the image size is equivalent to 7 A4 pages, so the processing time is 38 second per page.

With adaptive Otsu I get ‘Empty page!’ after 36 seconds.