tesseract: Tesseract seemingly stuck
Environment
- Tesseract Version: Latest
master - Commit Number: (
23ed59bd7bca777e4e104c4ee540843373aa9869 - Platform:
Linux gentoo-x13 5.11.7-gentoo-dist #1 SMP Wed Mar 17 21:03:41 -00 2021 x86_64 AMD Ryzen 7 PRO 4750U with Radeon Graphics AuthenticAMD GNU/Linux
Current Behavior:
Tesseract hangs, seemingly never finishes
Expected Behavior:
Tesseract doesn’t hang and produces output normally
GDB backtrace (interrupted after more than 5 minutes):
merlijn@gentoo-x13 ~/archive/tesseract-src/tesseract $ time TESSDATA_PREFIX=/usr/share/tessdata LD_LIBRARY_PATH=`pwd` LD_LIBRARY_PATH=$LD_LIBARY_PATH:`pwd`/.libs gdb --args ./.libs/tesseract /tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm - hocr
GNU gdb (Gentoo 10.1 vanilla) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./.libs/tesseract...
(gdb) r
Starting program: /home/merlijn/archive/tesseract-src/tesseract/.libs/tesseract /tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm - hocr
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract 5.0.0-alpha-20201231-545-g23ed5' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
Estimating resolution as 246
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7ec49b6 in tesseract::CLIST_ITERATOR::forward (this=this@entry=0x7fffffffbec0)
at src/ccutil/clst.cpp:265
265 return current->data;
(gdb) bt
#0 0x00007ffff7ec49b6 in tesseract::CLIST_ITERATOR::forward (this=this@entry=0x7fffffffbec0)
at src/ccutil/clst.cpp:265
#1 0x00007ffff7ec4b8d in tesseract::CLIST::add_sorted (this=<optimized out>,
comparator=comparator@entry=
0x7ffff7d86b90 <tesseract::SortByBoxLeft<tesseract::ColPartition>(void const*, void const*)>,
unique=unique@entry=true, new_data=<optimized out>, new_data@entry=0x5555bf4f66b0)
at src/ccutil/clst.cpp:176
#2 0x00007ffff7e2cdf7 in tesseract::BBGrid<tesseract::ColPartition, tesseract::ColPartition_CLIST, tesseract::ColPartition_C_IT>::InsertBBox (this=this@entry=0x55555b031130, h_spread=h_spread@entry=true,
v_spread=v_spread@entry=true, bbox=0x5555bf4f66b0) at src/textord/bbgrid.h:551
#3 0x00007ffff7e3f664 in tesseract::ColPartitionGrid::ComputeTotalOverlap (
this=this@entry=0x5555555aec68, overlap_grid=overlap_grid@entry=0x7fffffffc158)
at src/textord/colpartitiongrid.cpp:329
#4 0x00007ffff7e71620 in tesseract::StrokeWidth::DetectAndRemoveNoise (this=0x55555558c420,
pre_overlap=95268, grid_box=..., block=0x55555558bf20, part_grid=0x5555555aec68,
diacritic_blobs=0x7fffffffc688) at src/textord/strokewidth.cpp:1350
#5 0x00007ffff7e729da in tesseract::StrokeWidth::FindInitialPartitions (
this=this@entry=0x55555558c420, pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO,
rerotation=..., find_problems=find_problems@entry=true, block=block@entry=0x55555558bf20,
diacritic_blobs=diacritic_blobs@entry=0x7fffffffc688, part_grid=0x5555555aec68,
big_parts=0x5555555aec98, skew_angle=0x7fffffffc340) at src/textord/strokewidth.cpp:1310
#6 0x00007ffff7e72c08 in tesseract::StrokeWidth::GradeBlobsIntoPartitions (this=0x55555558c420,
pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO, rerotation=...,
block=block@entry=0x55555558bf20, nontext_pix=..., denorm=<optimized out>, cjk_script=false,
projection=0x5555555aecc0, diacritic_blobs=0x7fffffffc688, part_grid=0x5555555aec68,
big_parts=0x5555555aec98) at src/textord/strokewidth.cpp:379
#7 0x00007ffff7e2be71 in tesseract::ColumnFinder::FindBlocks (this=this@entry=0x5555555aeb30,
pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO, scaled_color=...,
scaled_factor=<optimized out>, input_block=input_block@entry=0x55555558bf20, photo_mask_pix=...,
thresholds_pix=..., grey_pix=..., pixa_debug=0x7ffff7c6abd0, blocks=0x7fffffffc5e8,
diacritic_blobs=0x7fffffffc688, to_blocks=0x7fffffffc690) at src/textord/colfind.cpp:296
#8 0x00007ffff7d5509b in tesseract::Tesseract::AutoPageSeg (this=0x7ffff7c47010,
pageseg_mode=tesseract::PSM_AUTO, blocks=0x5555555b0c90, to_blocks=0x7fffffffc690,
diacritic_blobs=0x7fffffffc688, osd_tess=<optimized out>, osr=0x7fffffffca40)
at src/ccmain/pagesegmain.cpp:226
#9 0x00007ffff7d5555d in tesseract::Tesseract::SegmentPage (this=0x7ffff7c47010,
input_file=<optimized out>, blocks=0x5555555b0c90, osd_tess=osd_tess@entry=0x0,
osr=osr@entry=0x7fffffffca40) at src/ccmain/pagesegmain.cpp:140
#10 0x00007ffff7d227bf in tesseract::TessBaseAPI::FindLines (this=0x7fffffffd780)
at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/bits/basic_string.h:2300
#11 0x00007ffff7d24f64 in tesseract::TessBaseAPI::Recognize (this=0x7fffffffd780, monitor=0x0)
at src/api/baseapi.cpp:838
#12 0x00007ffff7d2552a in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x7fffffffd780,
pix=0x5555555b1c50, page_index=page_index@entry=0,
filename=filename@entry=0x7fffffffdf54 "/tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm",
retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=
0x5555555a2810) at src/api/baseapi.cpp:1259
#13 0x00007ffff7d26172 in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x7fffffffd780,
filename=<optimized out>, retry_config=0x0, timeout_millisec=0, renderer=0x5555555a2810)
at src/api/baseapi.cpp:1218
#14 0x00007ffff7d2673f in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffd780,
filename=filename@entry=0x7fffffffdf54 "/tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm",
retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
renderer=<optimized out>) at src/api/baseapi.cpp:1071
#15 0x0000555555558295 in main (argc=<optimized out>, argv=<optimized out>)
at src/api/tesseractmain.cpp:783
Image: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008.ppm
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 28 (21 by maintainers)
With the code from #3418, the processing ends after 4:30 minutes, when Sauvola binarization is used. The output looks good.
Note that the image size is equivalent to 7 A4 pages, so the processing time is 38 second per page.
With adaptive Otsu I get ‘Empty page!’ after 36 seconds.