tesseract: 4.0 bugs on MAC OS X and a step by step for reference
This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work. I’m sharing this “guide” with the intention of helping other people who may have the same problems I had.
Special thanks for Shree that helped me at the google groups
Project and more details: https://github.com/tesseract-ocr/tesseract
where to get help?
google group: https://groups.google.com/forum/#!forum/tesseract-ocr git: https://github.com/tesseract-ocr/tesseract/issues
Platform: MAC OS X 10.13.3 Tesseract: 4.0.0-beta.1-69-g10f4 leptonica-1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2 Found AVX Found SSE
Compiling Tesseract - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos
Warning: Don’t install tesseract using brew, since you can’t generate the ScrollView.jar from it! (At least I wasn’t able to generate it)
Steps
1 - Install these libs
brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc
2 - Run the code
ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c
Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1
3 - Clone tesseract repo
git clone https://github.com/tesseract-ocr/tesseract/
4 - Enter in the folder
cd tesseract
5 - Run the script
./autogen.sh
6 - Run the code, and copy the CPPFLAGS and LDFLAGS
brew info icu4c
7 - Update the CPPFLAGS and LDFLAGS and execute the code
./configure \
CPPFLAGS=-I/usr/local/opt/icu4c/include \
LDFLAGS=-L/usr/local/opt/icu4c/lib
8 - Run the code
make -j
9 - Run the code
sudo make install
10 - Run the code
sudo update_dyld_shared_cache
Obs.: this is the sudo ldconfig version for MAC OS X
11 - Run the code
make training
Creating ScrollView.jar - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging
Important: Use the JDK 8 to build, or else it is going to return an error
Steps
1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar
2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java
3 - Enter the tesseract/java folder
cd java
4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code
SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar
Training Font - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain
Steps
1 - Clone the langdata dir from git
git clone https://github.com/tesseract-ocr/langdata
2 - Enter the tesseract folder
cd ..
3 - Execute this code and select one font from the list (I recommend “Verdana”)
text2image --list_available_fonts --fonts_dir=/Library/Fonts
Font dir for MAC can be : ~/Library/Fonts /Library/Fonts/ /Network/Library/Fonts/ /System/Library/Fonts/ /System Folder/Fonts/
More details here: https://support.apple.com/en-us/HT201722
4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
Obs.: this is a fix for the error:
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied
5 - Clone the tessdata repo from git (i recommend the “tessdata_best” since it is the more precise, “tessdata_fast” is just more fast)
git clone https://github.com/tesseract-ocr/tessdata_best
or
git clone https://github.com/tesseract-ocr/tessdata_fast
6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/
7 - Create the training data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/engtrain
Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX
8 - Create other training data using other font to compare
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
--output_dir ~/tesstutorial/engeval
Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX
9 - Create the needed folder
mkdir -p ~/tesstutorial/engoutput
10 - Start the training
SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1
11 - Monitor the log on another console
tail -f ~/tesstutorial/engoutput/basetrain.log
12 - Test Accuracy with other font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
13 - Test Accuracy with best traindata
~/projects/tesseract/training/lstmeval \
--model ~/projects/tessdata_best/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
14 - Test Accuracy with actual traindata (in this case the same as step 13)
~/projects/tesseract/training/lstmeval \
--model ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Fine tuning - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
Steps
1 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_small
2 - Start to fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_small/verdana \
--continue_from ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 1200
3 - Validate the progress
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
4 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_full
5 - Combine the trained data
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/verdana_from_full/eng.lstm
6 - Train merged data
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_full/verdana \
--continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 400
7 - Validate the results on the main training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
8 - Validate the results on our training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Fine tuning add ± character - tesseract 4.0
Steps
1 - Modify langdata/eng/eng.training_text and include these lines:
alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
2 - Generate the training file
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
"Times New Roman, Bold" \
"Times New Roman, Bold Italic" \
"Times New Roman, Italic" \
"Courier New" \
"Courier New Bold" \
"Courier New Bold Italic" \
"Courier New Italic" \
--output_dir ~/tesstutorial/trainplusminus
3 - Generate the eval data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/evalplusminus
4 - Combine trained data files
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/trainplusminus/eng.lstm
5 - Fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/trainplusminus/plusminus \
--continue_from ~/tesstutorial/trainplusminus/eng.lstm \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
--max_iterations 3600
6 - Test the result on other fonts
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt
6 - Test the result test on main font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 13
- Comments: 57 (13 by maintainers)
Links to this issue
Commits related to this issue
- fix "mktemp -d --tmpdir" on Mac OS; see #1453 — committed to tesseract-ocr/tesseract by zdenop 6 years ago
- Merge branch 'master' of https://github.com/tesseract-ocr/tesseract * 'master' of https://github.com/tesseract-ocr/tesseract: (27 commits) Rework check for readable input file fix "mktemp -d --tm... — committed to tesseract-ocr/tesseract by zdenop 6 years ago
- Fix installation of training tools for flat training build Builds which were configured with --enable-shared did install the wrong files. Using libtool fixes that. Add also other flags which are use... — committed to tesseract-ocr/tesseract by stweil 4 years ago
Please check your output after running this code:
./configure \CPPFLAGS=-I/usr/local/opt/icu4c/include \LDFLAGS=-L/usr/local/opt/icu4c/libI came across the same error and the log showed me an issue with icu4c and also asked to install pango.
Once done, run the above code again and hopefully your error will be solved.
@FernandoGOT Thank you very much for such a detailed explanation but I can’t make it work. When I say “make training” it gives me “Need to reconfigure project, so there are no errors” error. Also, I couldn’t create ScrollView.jar. Is it possible to update this post? Thank you.
@nnnikolay, I am sorry, that was my fault. It is now fixed with commit 421ebf0418f415c2ca270521243d4edc36dd44bf.
@khalajink Yes, see my answer in that SO thread https://stackoverflow.com/a/57968945/1021819
@ysnnzlcn I’m out of times these days (working too much), but when I get some free time I’m going to make a better step-by-step of how to use tesseract and send a merge to the docs