tesseract: LSTM: Training - Error msg - Encoding of string failed!

$   training/lstmtraining --model_output ~/tesstutorial/sanskrit2003_from_full/sanskrit2003 \
>   --continue_from ~/tesstutorial/sanskrit2003_from_full/san.lstm \
>   --train_listfile ~/tesstutorial/santrain/san.training_files.txt \
>   --target_error_rate 0.01
Loaded file /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint
Loaded 1746/1746 pages (0-1746) of document /home/shree/tesstutorial/santrain/san.Chandas.exp0.lstmf
Loaded 345/1760 pages (1415-1760) of document /home/shree/tesstutorial/santrain/san.Uttara.exp0.lstmf
Loaded 1814/1814 pages (0-1814) of document /home/shree/tesstutorial/santrain/san.Gargi.exp0.lstmf
Found AVX
Found SSE
At iteration 1808/17200/17229, Mean rms=0.336%, delta=0.129%, char train=0.41%, word train=1.751%, skip ratio=0.2%,  New worst char error = 0.41 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffc2 ffffffa3 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffffb5
Can't encode transcription: व्यतर्कि १४. भवति ३७॥ £ सर्व्व
At iteration 1818/17300/17330, Mean rms=0.334%, delta=0.13%, char train=0.404%, word train=1.632%, skip ratio=0.3%,  wrote checkpoint.


About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Comments: 37 (10 by maintainers)

Most upvoted comments

@harinath141 If you are getting a lot of these errors during finetune, try replace top layer training. You can use the box/tiff pairs generated for finetune. Commands will be similar to the following:

mkdir -p ~/tesstutorial/tellayer_from_tel 

combine_tessdata -e ../tessdata/tel.traineddata \
  ~/tesstutorial/tellayer_from_tel/tel.lstm
  
lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
  --script_dir ../langdata  --debug_interval 0 \
  --continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --model_output ~/tesstutorial/tellayer_from_tel/tellayer \
  --train_listfile ~/tesstutorial/tel/tel.training_files.txt \
  --target_error_rate 0.01

Same problem as I had mentioned in one of my earlier comments -

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there.

No answer from @theraysmith yet… He has also marked this as a closed issue.

As per @theraysmith

  • There is an un-represented Indic grapheme/aksara in the text. In any case it will result in that training image being ignored by the trainer. If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

@zc813

tesstrain.sh has a limit of max_pages 3, you should change that so that complete training_text is used.

You can review the training_text to see that it is correct representation of bod(Tibetan).

Also test with ‘Tibetan’ script traineddata from both ‘tessdata_best’ and ‘tessdata_fast’ repo for OCR.

Authoritative answer can only be provided by @theraysmith.