tesseract: Can't encode transcription

Unable to fine-tune Arabic model for font ‘Andalus’, getting this error:

Encoding of string failed! Failure bytes: 26 26
Can't encode transcription: 'و ىدتنملا ىدتنم الإ دق عيضاوملا ؟؟ عيقوتلا ليجستلا &&' in language ''
Encoding of string failed! Failure bytes: 3d 3d 20 ffffffd9 ffffff89 ffffffd9 ffffff81 20 ffffffd9 ffffff88 ffffffd8 ffffffa3 20 ffffffd9 ffffff84 ffffffd8 ffffffa8 ffffffd9 ffffff82 20 ffffffd9 ffffff89 ffffffd8 ffffffaf ffffffd8 ffffffaa ffffffd9 ffffff86 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff85 20 ffffffd9 ffffff86 ffffffd9 ffffff88 ffffffd9 ffffff83 ffffffd8 ffffffaa 20 ffffffd8 ffffffa9 ffffffd8 ffffffad ffffffd9 ffffff81 ffffffd8 ffffffb5 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd8 ffffffa9 ffffffd9 ffffff83 ffffffd8 ffffffb1 ffffffd8 ffffffa7 ffffffd8 ffffffb4 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7

Please note that the line making the error is the pre-last line in the ara.training_txt file, that contains: && التسجيل التوقيع ؟؟ المواضيع قد إلا منتدى المنتدى و

I’m using langdata_lstm for generating my training data and the ara.traineddata to continue from.

generating data:

../tesseract/src/training/tesstrain.sh --fonts_dir fonts/win7df \
	     --fontlist 'Andalus' \
	     --lang ara \
	     --linedata_only \
	     --langdata_dir ../langdata_lstm \
	     --tessdata_dir ../tesseract/tessdata \
	     --save_box_tiff \
	     --maxpages 10 \
	     --output_dir train

extracting old lstm: combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm

fine-tuning:

rm -rf output/*
OMP_THREAD_LIMIT=8 lstmtraining \
	--continue_from ara.lstm \
	--model_output output/araNewModel \
	--traineddata ../tesseract/tessdata/ara.traineddata \
	--train_listfile train/ara.training_files.txt \
	--max_iterations 400

I’d checked the generated train data, where everything seems to be good, and tiff files includes all the train_text lines including the line making the error. I’d also tried to generate train data and fine tune for different fonts like ‘Arial’ and ‘Tahoma’ but still getting the same error.

I was thinking about removing the error line from the train_text file, but I don’t know if it is safe or not. Besides, I think that 80 lines for training Arabic models is very small, isn’t it?!!! So what if I decided to train for more lines of data, what should I do, and what files would be affected in such case?

Regards

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 24 (7 by maintainers)

Most upvoted comments

@peterbence3 The unicharset is extracted from the training_text file. You don’t need so many steps.

I used the following script:

#!/bin/bash

time ~/tesseract/src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang ara --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/langdata \
  --tessdata_dir ~/tessdata \
  --fontlist "Andalus" \
  --training_text ~/langdata/ara/ara.training_text \
  --workspace_dir ~/tmp/ \
  --save_box_tiff \
  --output_dir ~/tesstutorial/araeval
  
echo "/n ****** Finetune one of the fully-trained existing models: ***********"

mkdir -p ~/tesstutorial/ara_from_full

combine_tessdata -e ~/tessdata_best/ara.traineddata \
  ~/tesstutorial/ara_from_full/ara.lstm
  
lstmtraining \
  --model_output ~/tesstutorial/ara_from_full/PLUS \
   --continue_from ~/tesstutorial/ara_from_full/ara.lstm \
   --traineddata ~/tesstutorial/araeval/ara/ara.traineddata \
   --old_traineddata ~/tessdata_best/ara.traineddata \
   --train_listfile ~/tesstutorial/araeval/ara.training_files.txt \
   --debug_interval -1 \
   --max_iterations 3600
  
echo -e "\n****************************  ******\n"

lstmeval \
  --model ~/tessdata_best/ara.traineddata \
  --eval_listfile ~/tesstutorial/araeval/ara.training_files.txt
  
echo -e "\n****************************  ******\n"

lstmeval \
  --model ~/tesstutorial/ara_from_full/PLUS_checkpoint \
   --traineddata ~/tesstutorial/araeval/ara/ara.traineddata \
  --eval_listfile ~/tesstutorial/araeval/ara.training_files.txt

echo -e "\n****************************  ******\n"

time lstmtraining \
  --stop_training \
  --continue_from ~/tesstutorial/ara_from_full/PLUS_checkpoint \
  --traineddata ~/tessdata_best/ara.traineddata \
  --model_output ~/tesstutorial/ara_from_full/ara.Andalus.PLUS.traineddata

Shreeshrii on Oct 8, 2019

@peterbence3 for extending the training_text for finetuning - you can try to reengineer the files from the trainnedata.

# unpack best traineddata file
combine_tessdata  -u ~/tessdata_best/ara.traineddata  ara.

# create wordlist from dawg file - use files extracted from best traineddata
dawg2wordlist ara.lstm-unicharset ara.lstm-word-dawg  ara.lstm-wordlist

# copy wordlist and word-bigrams from langdata/ara
cp ~/langdata/ara/ara.wordlist ./
cp ~/langdata/ara/ara.word.bigrams ./

# concatenate various wordlists, shuffle and convert to text lines
cat ara.wordlist ara.word.bigrams ara.lstm-wordlist | sort | uniq > ara.lstm-wordlist-sorted
shuf  ara.lstm-wordlist-sorted > ara.lstm-wordlist-shuffled
par 150l < ara.lstm-wordlist-shuffled > ara.lstm-wordlist-lines

# concatenate with existing training text
cat ~/langdata/ara/ara.training_text ara.lstm-wordlist-lines > ara.extended.training_text

Shreeshrii on Oct 14, 2019

@Shreeshrii thanks a lot, now its working super fine, steps I followed:

1. clone tesseract, followed installation instructions to build and install from source

2. clone langdata-lstm

3. getting the tessdata_best/ara.traineddata and place it at tessertact/tessdata inside the project cloned in step one

4. edit langdata_lstm/ara/ara.traning_text by adding few more lines

5. download a set of Arabic fonts that i need to fine tune for (place them in any folder)

6. generating the train data files as follows:

tesstrain.sh --fonts_dir fonts/win7df \
	     --fontlist 'Courier New' 'Segoe UI' 'Tahoma' 'Times New Roman' 'Arial' 'Andalus' 'Microsoft Sans Serif' 'Adobe Arabic' \
	     --lang ara \
	     --linedata_only \
	     --langdata_dir ../langdata_lstm \
	     --tessdata_dir ../tesseract/tessdata \
	     --save_box_tiff \
	     --maxpages 10 \
	     --output_dir train

this will generate a new train data, including the ara.training_files.txt plus a folder named ‘ara’ that contains a starter ara.traineddata (actually not trained yet) containing your unicharset file that was automatically generated by tesstrain.sh for your custom ara.training_text.

7. extracting ara.lstm from the ara.traineddata file to continue training from it later using: combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm

8. now everything is ready, execute the fine-tuning like:

OMP_THREAD_LIMIT=8 lstmtraining \
	--continue_from starter/old_extracted/ara.lstm \
	--model_output output/araNewModel \
	--old_traineddata starter/old_traineddata/ara.traineddata \
	--traineddata train/ara/ara.traineddata \
	--train_listfile train/ara.training_files.txt \
	--max_iterations 1000

9. enjoy with no encoding errors

Thanks all

peterbence3 on Oct 8, 2019

It is generated by tesstrain.sh - see https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain_utils.sh#L346

Shreeshrii on Oct 8, 2019