tesseract: 4.0 bugs on MAC OS X and a step by step for reference

This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work. I’m sharing this “guide” with the intention of helping other people who may have the same problems I had.

Special thanks for Shree that helped me at the google groups

Project and more details: https://github.com/tesseract-ocr/tesseract

where to get help?

google group: https://groups.google.com/forum/#!forum/tesseract-ocr git: https://github.com/tesseract-ocr/tesseract/issues

Platform: MAC OS X 10.13.3 Tesseract: 4.0.0-beta.1-69-g10f4 leptonica-1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

Found AVX2 Found AVX Found SSE

Compiling Tesseract - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos

Warning: Don’t install tesseract using brew, since you can’t generate the ScrollView.jar from it! (At least I wasn’t able to generate it)

Steps

1 - Install these libs

brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc

2 - Run the code

ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c

Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1

3 - Clone tesseract repo

git clone https://github.com/tesseract-ocr/tesseract/

4 - Enter in the folder

cd tesseract

5 - Run the script

./autogen.sh

6 - Run the code, and copy the CPPFLAGS and LDFLAGS

brew info icu4c

7 - Update the CPPFLAGS and LDFLAGS and execute the code

./configure \
  CPPFLAGS=-I/usr/local/opt/icu4c/include \
  LDFLAGS=-L/usr/local/opt/icu4c/lib

8 - Run the code

make -j

9 - Run the code

sudo make install

10 - Run the code

sudo update_dyld_shared_cache

Obs.: this is the sudo ldconfig version for MAC OS X

11 - Run the code

make training

Creating ScrollView.jar - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging

Important: Use the JDK 8 to build, or else it is going to return an error

Steps

1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar

http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar

2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java

3 - Enter the tesseract/java folder

cd java

4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code

SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar

Training Font - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain

Steps

1 - Clone the langdata dir from git

git clone https://github.com/tesseract-ocr/langdata

2 - Enter the tesseract folder

cd ..

3 - Execute this code and select one font from the list (I recommend “Verdana”)

text2image --list_available_fonts --fonts_dir=/Library/Fonts

Font dir for MAC can be : ~/Library/Fonts /Library/Fonts/ /Network/Library/Fonts/ /System/Library/Fonts/ /System Folder/Fonts/

More details here: https://support.apple.com/en-us/HT201722

4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from

- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)

Obs.: this is a fix for the error:

mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
       mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied

5 - Clone the tessdata repo from git (i recommend the “tessdata_best” since it is the more precise, “tessdata_fast” is just more fast)

git clone https://github.com/tesseract-ocr/tessdata_best

or

git clone https://github.com/tesseract-ocr/tessdata_fast

6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/

7 - Create the training data

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Verdana" \
  --output_dir ~/tesstutorial/engtrain

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

8 - Create other training data using other font to compare

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Times New Roman," \
  --output_dir ~/tesstutorial/engeval

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

9 - Create the needed folder

mkdir -p ~/tesstutorial/engoutput

10 - Start the training

SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1

11 - Monitor the log on another console

tail -f ~/tesstutorial/engoutput/basetrain.log

12 - Test Accuracy with other font

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

13 - Test Accuracy with best traindata

~/projects/tesseract/training/lstmeval \
  --model ~/projects/tessdata_best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

14 - Test Accuracy with actual traindata (in this case the same as step 13)

~/projects/tesseract/training/lstmeval \
  --model ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Steps

1 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_small

2 - Start to fine tuning

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/verdana_from_small/verdana \
  --continue_from ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 1200

3 - Validate the progress

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

4 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_full

5 - Combine the trained data

~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/eng.traineddata \
  ~/tesstutorial/verdana_from_full/eng.lstm

6 - Train merged data

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/verdana_from_full/verdana \
  --continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 400

7 - Validate the results on the main training file

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

8 - Validate the results on our training file

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning add ± character - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

Steps

1 - Modify langdata/eng/eng.training_text and include these lines:

alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED

2 - Generate the training file

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Times New Roman," \
              "Times New Roman, Bold" \
              "Times New Roman, Bold Italic" \
              "Times New Roman, Italic" \
              "Courier New" \
              "Courier New Bold" \
              "Courier New Bold Italic" \
              "Courier New Italic" \
  --output_dir ~/tesstutorial/trainplusminus

3 - Generate the eval data

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Verdana" \
  --output_dir ~/tesstutorial/evalplusminus

4 - Combine trained data files

~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/eng.traineddata \
  ~/tesstutorial/trainplusminus/eng.lstm

5 - Fine tuning

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/trainplusminus/plusminus \
  --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
  --max_iterations 3600

6 - Test the result on other fonts

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt

6 - Test the result test on main font

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 13
  • Comments: 57 (13 by maintainers)

Commits related to this issue

Most upvoted comments

@FernandoGOT Thank you very much for such a detailed explanation but I can’t make it work. When I say “make training” it gives me “Need to reconfigure project, so there are no errors” error. Also, I couldn’t create ScrollView.jar. Is it possible to update this post? Thank you.

Please check your output after running this code: ./configure \ CPPFLAGS=-I/usr/local/opt/icu4c/include \ LDFLAGS=-L/usr/local/opt/icu4c/lib

I came across the same error and the log showed me an issue with icu4c and also asked to install pango.

Once done, run the above code again and hopefully your error will be solved.

@FernandoGOT Thank you very much for such a detailed explanation but I can’t make it work. When I say “make training” it gives me “Need to reconfigure project, so there are no errors” error. Also, I couldn’t create ScrollView.jar. Is it possible to update this post? Thank you.

@nnnikolay, I am sorry, that was my fault. It is now fixed with commit 421ebf0418f415c2ca270521243d4edc36dd44bf.

@khalajink Yes, see my answer in that SO thread https://stackoverflow.com/a/57968945/1021819

@ysnnzlcn I’m out of times these days (working too much), but when I get some free time I’m going to make a better step-by-step of how to use tesseract and send a merge to the docs