tesseract: tesstrain.sh script exits with error
Short description
I am trying to train Tesseract on Akkadian language. The language-specific.sh script was modified accordingly. When converting the training text to TIFF images, the text2image program crashes.
Environment
- Tesseract Version: 3.04.01
- Commit Number: the standard package in Ubuntu, package version 3.04.01-4, commit unknown
- Platform: Linux ubuntu-xenial 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
The environment was created using Vagrant. The commands are started on command line without GUI environment.
Running tesseract -v produces following output:
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
Current Behavior:
When running tesstrain.sh with these command
./tesstrain.sh --lang akk --training_text corpus-12pt.txt --tessdata_dir /usr/share/tesseract-ocr/tessdata --langdata_dir ../langdata --fonts_dir /usr/share/fonts --fontlist "CuneiformNAOutline Medium" "CuneiformOB" --output_dir .
the text2image crashes on every font with this message:
cluster_text.size() == start_byte_to_box.size():Error:Assert failed:in file stringrenderer.cpp, line 541
As a result, no box files are generated, so tesstrain.sh exits with these messages:
ERROR: /tmp/tmp.XSb02nt10d/akk/akk.CuneiformOB.exp0.box does not exist or is not readable
ERROR: /tmp/tmp.XSb02nt10d/akk/akk.CuneiformNAOutline_Medium.exp0.box does not exist or is not readable
Expected Behavior:
tesstrain.sh should create the box files and proceed with training.
Attachments:
I attached all files used: akktrain.zip.
The fonts are hosted here, but for the sake of completeness the .ttf-files are included in the archive; they shoud be moved to /usr/share/fonts.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 1
- Comments: 25 (6 by maintainers)
Dear @Shreeshrii ,
thank you very much for your help! The training worked beautifully after rewrapping the corpus (I wrote a short script in Python 3, as it works beautifully with utf-8 documents; you can find it here)
I had to rewrap the corpus by 35 characters per line. The widths of 10, 20, 30, 40, 50, 60 and 70 characters per line did not work. But this is another issue, I think.
To be honest I both followed and disregarded your advice about using training scripts for Tesseract4 only at the same time. As a matter of fact, I need to use Tesseract3, thus I tested the old training scripts first. Nevertheless, I created a Docker container with Ubuntu Bionic and ran the training script for and with Tesseract4. It worked as well as with Tesseract3.
Hence we can regard the rewrapping of a corpus file as an official workaround for this issue. Shall I edit the wiki pages about training Tesseract3 and Tesseract4?
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=./valgrind-out.txt /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.opjpN7f94T --fonts_dir=…/.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=-1 --outputbase=/tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1 --max_pages=0 --font=CuneiformOB --text=./langdata/akk/corpus-12pt.txt
valgrind-out.txt
@zdenop The first post about issue has the info you want.
Attachments:
@wincentbalin had a workaround the problem too…
@stweil : can you provide font and text that failed for you?
Issue #765 is a duplicate.
I looked at the LSTM training results that I have. They have CER of less than 10% and WER of 15%.
See note above code section that cause crash:
Is this with tesseract 3.05? Training for legacy engine?
I had done a training run for LSTM but didn’t test it. I will share it in a day or two, I am traveling now.
If you can make your test images and ground truth available some place, I can check accuracy too.
On Wed, 19 Sep 2018, 20:53 Wincent Balin, notifications@github.com wrote:
Currently the WER is around 10 per cent, but sometimes I got it lower. I think it requires some tinkering.
The program I use takes random words from the wordlist, creates texts from it and saves the text into an image using
text2image. Thentesseractis used to recognize the text back and the WER of the result in comparison to the original text is calculated.https://en.wikipedia.org/wiki/Cuneiform_script
It will be hard to get a good accuracy on >1900 years old material.
OK
I think 35 characters per line is dependent on the size of akkadian characters in the fonts that were used. Don’t think that will be the case globally for all languages.
Sure. My suggestion was geared more towards using tesseract4.
What kind of accuracy are you getting with the akk traineddata with tesseract3?
On Thu, Aug 2, 2018 at 2:01 AM Wincent Balin notifications@github.com wrote:
–
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
The issue #765 seems to be (roughly) related to this one.
@ stweil This assert seems to be related to the size of training_text line. Is it possible to have a more descriptive error message, if that is indeed the case.
I used notepad++ on windows. Yes, it displays correctly after changing encoding to utf-8.
I think the problem is being caused by extra long lines. I saved file as utf-8 and split the long lines to a smaller size and it seems to be working ok.
Make sure you are using the new version of tesstrain.sh (uninstall the version from tesseract 3).
file corpus-12pt.txtrecognizes UTF-8 encoding. Which software do you use to look at the text?If you choose the right encoding, this image should appear:
Hello @Shreeshrii,
I ran the
tesstrain.shwith the same options under Ubuntu Bionic (in a Docker container) and got the same results, as well as the attached coredump.The version information is
P.S.: the
language-specific.shscript is attached too.Please try with latest version of tesseract.