tesseract: text2image fails to generete box file when enable --find_fonts, not supported on multilingual text
Environment
- Tesseract Version: tesseract 4.00.00alpha leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
Found AVX2 Found AVX Found SSE
- Commit Number:
- Platform: ubuntu 16.04 Linux <my_container_id> 4.4.0-97-generic # 120-Ubuntu SMP Tue Sep 19 17:28:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Current Behavior:
I first tried
training/text2image --font='Microsoft Sans Serif' --text /data/dataset/tha+eng/txt/0.txt --outputbase /data/dataset/tha+eng/0 --max_pages=1 --box_padding=2 --ptsize=10 --underline_continuation_prob=0.1 --fonts_dir=/usr/share/fonts --min_coverage 0.9
It gives both tif and box files.
Because my text file is a mixture of 2 different languages, I enable the --find_fonts to find the fonts that are available for rendering. So I tried
training/text2image --find_fonts --text /data/dataset/tha+eng/txt/0.txt --outputbase /data/dataset/tha+eng/0 --min_coverage 0.9 --max_pages=1 --box_padding=2 --ptsize=10 --underline_continuation_prob=0.1 --fonts_dir=/usr/share/fonts
Then it only gives one output, 0.Microsoft Sans Serif.tif, which means it only renders for 1 font and has no output for box file. I also tried on english-only text using the same --find_fonts flag, and it has the same problem.
Another problem is when I tried to render for a file with 3 languages, using --find_fonts, it returns
Stripped 1 unrenderable words
Microsoft Sans Serif : 276 hits = 96.50%, raw = 85 = 89.47%
Rendered page 0 to file /data/dataset/tha+eng+zho/0.Microsoft_Sans_Serif.tif
When I tried with --fonts=Microsoft Sans Serif, it returns:
Stripped 1 unrenderable words
Rendered page 0 to file /data/dataset/tha+eng+zho/0.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!
Expected Behavior:
Give multiple tif.box pairs in different fonts
Suggested Fix:
It is hard to find fonts which support 3 languages from different writing system. Any chance to provide a function which can render on whatever fonts available on each language text?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15
I use the following command to get a list of fonts. Change the directory and file names.