tesseract: text2image fails to generete box file when enable --find_fonts, not supported on multilingual text

Environment

  • Tesseract Version: tesseract 4.00.00alpha leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8

Found AVX2 Found AVX Found SSE

  • Commit Number:
  • Platform: ubuntu 16.04 Linux <my_container_id> 4.4.0-97-generic # 120-Ubuntu SMP Tue Sep 19 17:28:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

I first tried

training/text2image --font='Microsoft Sans Serif' --text /data/dataset/tha+eng/txt/0.txt --outputbase /data/dataset/tha+eng/0 --max_pages=1 --box_padding=2 --ptsize=10  --underline_continuation_prob=0.1  --fonts_dir=/usr/share/fonts  --min_coverage 0.9

It gives both tif and box files.

Because my text file is a mixture of 2 different languages, I enable the --find_fonts to find the fonts that are available for rendering. So I tried

training/text2image --find_fonts  --text /data/dataset/tha+eng/txt/0.txt --outputbase /data/dataset/tha+eng/0 --min_coverage 0.9 --max_pages=1 --box_padding=2 --ptsize=10  --underline_continuation_prob=0.1  --fonts_dir=/usr/share/fonts

Then it only gives one output, 0.Microsoft Sans Serif.tif, which means it only renders for 1 font and has no output for box file. I also tried on english-only text using the same --find_fonts flag, and it has the same problem.

Another problem is when I tried to render for a file with 3 languages, using --find_fonts, it returns

Stripped 1 unrenderable words
Microsoft Sans Serif : 276 hits = 96.50%, raw = 85 = 89.47%
Rendered page 0 to file /data/dataset/tha+eng+zho/0.Microsoft_Sans_Serif.tif

When I tried with --fonts=Microsoft Sans Serif, it returns:

Stripped 1 unrenderable words
Rendered page 0 to file /data/dataset/tha+eng+zho/0.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!

Expected Behavior:

Give multiple tif.box pairs in different fonts

Suggested Fix:

It is hard to find fonts which support 3 languages from different writing system. Any chance to provide a function which can render on whatever fonts available on each language text?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15

Most upvoted comments

I use the following command to get a list of fonts. Change the directory and file names.

nice text2image --find_fonts \
--fonts_dir ./.fonts \
--text ./langdata/ara/ara.diacritics.training_text \
--min_coverage 0.99975 \
--render_per_font=false \
--outputbase ./langdata/ara/ara \
|& grep raw \
 | sed -e 's/ :.*/@ \\/g' \
 | sed -e "s/^/  '/" \
 | sed -e "s/@/'/g" > ./langdata/ara/ara.diacritics.fontslist.txt