tesstrain: training failed for persian language with new font

Dear All,

I am trying to train the tesseract with new font (“B Nazanin” attached to the issue) here is my steps, and I am using the langdata_lstm git and tessdata is the tessdata_best. and for fas.config i used atteched file the same as arabic, arabic and persian has same structure with similar letter and words. (but not exact the same).

but the fas.traineddata in here is not valid, i tying to use the apt installed file in my /usr/share/tesseract-ocr/5/tessdata direcotry. this file is fine.

with the fas.training_text in langdata_lstm repository during executing the tesstrain.py i got this error :

[22:09:35] INFO - Log file location: /tmp/fas-2022-01-011bwkauqw/tesstrain.log
[22:09:35] INFO - === Starting training for language fas
[22:09:35] INFO - Testing font: B Nazanin
[22:09:37] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][22:09:37] INFO - Rendering using B Nazanin
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.36s/it]
[22:09:48] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[22:09:48] INFO - === Phase E: Generating lstmf files ===
[22:09:48] INFO - Using fas.config
[22:09:48] INFO - Using TESSDATA_PREFIX=tesseract/tessdata
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][22:09:49] ERROR - Page 1
Failed to read boxes from /tmp/fas-2022-01-011bwkauqw/fas.B_Nazanin.exp0.tif
Error during processing.

[22:09:49] CRITICAL - Program /usr/bin/tesseract failed with return code 1. Abort.
  0%|                                                                                                                                                                                 | 0/1 [00:01<?, ?it/s]
Temporary files retained at: /tmp/fas-2022-01-011bwkauqw

and if i changed the fas.training_text to the attached file, the first step passed. In eval (second step) I get this error : Can't encode transcription: and Encoding of string failed! Failure bytes: for almost all texts

fas.lstm is not a recognition model, trying training checkpoint...
Loaded 406/406 lines (1-406) of document train/fas.B_Nazanin.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Encoding of string failed! Failure bytes: d9 81 d9 82 d9 88 d8 aa d9 85 20 d8 a7 d8 b1 20 d8 b3 d9 84 d8 b7 d8 a7 20 d8 b3 d9 88 d9 86 d8 a7 db 8c d9 82 d8 a7 20 d8 b2 d8 a7 d8 b1 d9 81 20 d8 b1 d8 a8 20 d8 af d9 88 d8 ae 20 db 8c d8 a7 d9 87 d8 b2 d8 a7 d9 88 d8 b1 d9 be 20 da a9 db 8c d8 aa d9 86 d8 a7 d9 84 d8 aa d8 a2 20 d9 86 db 8c d8 ac d8 b1 db 8c d9 88 20 d9 88 20 d8 b2 db 8c d9 88 d8 b1 db 8c d8 a7 20 d8 b4 db 8c d8 aa db 8c d8 b1 d8 a8 20 d8 8c d8 b3 d9 86 d8 a7 d8 b1 d9 81 d8 b1 db 8c d8 a7 20 d8 af d9 86 d9 86 d8 a7 d9 85 20 db 8c d9 84 d9 84 d9 85 d9 84 d8 a7 20 d9 86 db 8c d8 a8 20 db 8c db 8c d8 a7 d9 85 db 8c d9 be d8 a7 d9 88 d9 87 20 db 8c d8 a7 d9 87 d8 aa da a9 d8 b1 d8 b4
Can't encode transcription: 'فقوتم ار سلطا سونایقا زارف رب دوخ یاهزاورپ کیتنالتآ نیجریو و زیوریا شیتیرب ،سنارفریا دننام یللملا نیب ییامیپاوه یاهتکرش' in language ''
Encoding of string failed! Failure bytes: d8 b2 d8 a7 20 db 8c d8 b1 d8 a7 db 8c d8 b3 d8 a8 20 d9 88 20 d8 aa d8 b3 d8 a7 20 d8 af d9 88 d8 ac d9 88 d9 85 20 d8 b9 d8 b6 d9 88 20 d8 b1 d8 a8 d8 a7 d8 b1 d8 a8 20 d9 88 d8 af 20 d8 b1 d9 88 d8 b4 da a9 20 d8 b1 d8 af 20 db 8c d8 aa d8 a7 db 8c d9 84 d8 a7 d9 85 20 d8 aa db 8c d9 81 d8 b1 d8 b8 20 d9 87 da a9 20 d8 af db 8c d9 88 da af 20 db 8c d9 85 20 db 8c d9 86 db 8c d8 a8 d9 85 d9 85 20 db 8c d8 a7 d9 82 d8 a2 2e d8 af d9 86 da a9 20 d9 85 da a9 20 d8 aa d9 84 d9 88 d8 af 20 db 8c d9 85 d9 88 d9 85 d8 b9 20 d9 87 d8 ac d8 af d9 88 d8 a8 20 d8 b1 d8 af 20 d8 a7 d8 b1
Can't encode transcription: 'لغاشم زا یرایسب و تسا دوجوم عضو ربارب ود روشک رد یتایلام تیفرظ هک دیوگ یم ینیبمم یاقآ.دنک مک تلود یمومع هجدوب رد ار' in language ''
Encoding of string failed! Failure bytes: 2e d8 af db 8c d8 b3 d8 b1 20 d8 af d9 87 d8 a7 d9 88 d8 ae 20 d8 a7 da a9 db 8c d8 b1 d9 85 d8 a2 20 d8 b1 da af db 8c d8 af 20 d8 aa d9 84 d8 a7 db 8c d8 a7 20 d9 87 d8 af d8 b2 d8 a7 d9 88 d8 af 20 d9 87 d8 a8 20 d8 8c 20 d9 87 d8 af d9 86 db 8c d8 a2 20 d8 aa d8 b9 d8 a7 d8 b3 20 db b3 db b6 20 d8 a7 d8 aa 20 db b2 db b4 20 d9 81 d8 b1 d8 b8 20 db 8c d8 af d9 86 d8 b3 20 d9 86 d8 a7 d9 81 d9 88 d8 aa 20 d8 8c d9 86 d8 a7 d8 b3 d8 a7 d9 86 d8 b4 d8 b1 d8 a7 da a9 20 db 8c d9 86 db 8c d8 a8 20 d8 b4 db 8c d9 be 20 d8 b3 d8 a7 d8 b3 d8 a7 d8 b1 d8 a8 2e d8 af db 8c d8 b3 d8 b1

my first step :

rm -rf train/*
../tesstrain/src/training/tesstrain.py --fonts_dir fonts \
        --fontlist 'B Nazanin' \
        --ptsize 20 \
        --lang fas \
        --linedata_only \
        --langdata_dir langdata_lstm \
        --tessdata_dir tesseract/tessdata \
        --save_box_tiff \
        --maxpages 10 \
        --output_dir train

I also tried with different font size for above script.

second step :

lstmeval --model fas.lstm \
        --traineddata tesseract/tessdata/fas.traineddata \
        --eval_listfile train/fas.training_files.txt

after this step I should to extract the lstm from the best train file :

combine_tessdata -e tesseract/tessdata/fas.traineddata fas.lstm

as i described above the extraction lstm is failed with traineddata in best repository, and i just used the installed version.

returned result :

Extracting tessdata components from tesseract/tessdata/fas.traineddata
Wrote fas.lstm
Version:5.0.0
17:lstm:size=2965531, offset=192
21:lstm-unicharset:size=1978, offset=2965723
22:lstm-recoder:size=301, offset=2967701
23:version:size=5, offset=2968002

here is my next step to fine tune the learning but it also retuned Can't encode transcription and Encoding of string failed! Failure bytes error for all texts

rm -rf output/*
OMP_THREAD_LIMIT=16 lstmtraining \
        --continue_from fas.lstm \
        --model_output output/moh \
        --traineddata tesseract/tessdata/fas.traineddata \
        --train_listfile train/fas.training_files.txt \
        --max_iterations 1000

attached files : 1- TTF font file 2- fas.config 3- fas.training_text (this is sample that works with script) (the langdata_lstm , training_text returned error in first step)

is there any solutions ?

IssueAttachments.zip

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 22

Most upvoted comments

Still waiting for response

mohsenomidi on May 22, 2023