tesseract: Can't encode transcription
Unable to fine-tune Arabic model for font ‘Andalus’, getting this error:
Encoding of string failed! Failure bytes: 26 26
Can't encode transcription: 'و ىدتنملا ىدتنم الإ دق عيضاوملا ؟؟ عيقوتلا ليجستلا &&' in language ''
Encoding of string failed! Failure bytes: 3d 3d 20 ffffffd9 ffffff89 ffffffd9 ffffff81 20 ffffffd9 ffffff88 ffffffd8 ffffffa3 20 ffffffd9 ffffff84 ffffffd8 ffffffa8 ffffffd9 ffffff82 20 ffffffd9 ffffff89 ffffffd8 ffffffaf ffffffd8 ffffffaa ffffffd9 ffffff86 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff85 20 ffffffd9 ffffff86 ffffffd9 ffffff88 ffffffd9 ffffff83 ffffffd8 ffffffaa 20 ffffffd8 ffffffa9 ffffffd8 ffffffad ffffffd9 ffffff81 ffffffd8 ffffffb5 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd8 ffffffa9 ffffffd9 ffffff83 ffffffd8 ffffffb1 ffffffd8 ffffffa7 ffffffd8 ffffffb4 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7
Please note that the line making the error is the pre-last line in the ara.training_txt file, that contains:
&& التسجيل التوقيع ؟؟ المواضيع قد إلا منتدى المنتدى و
I’m using langdata_lstm for generating my training data and the ara.traineddata to continue from.
generating data:
../tesseract/src/training/tesstrain.sh --fonts_dir fonts/win7df \
--fontlist 'Andalus' \
--lang ara \
--linedata_only \
--langdata_dir ../langdata_lstm \
--tessdata_dir ../tesseract/tessdata \
--save_box_tiff \
--maxpages 10 \
--output_dir train
extracting old lstm:
combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm
fine-tuning:
rm -rf output/*
OMP_THREAD_LIMIT=8 lstmtraining \
--continue_from ara.lstm \
--model_output output/araNewModel \
--traineddata ../tesseract/tessdata/ara.traineddata \
--train_listfile train/ara.training_files.txt \
--max_iterations 400
I’d checked the generated train data, where everything seems to be good, and tiff files includes all the train_text lines including the line making the error. I’d also tried to generate train data and fine tune for different fonts like ‘Arial’ and ‘Tahoma’ but still getting the same error.
I was thinking about removing the error line from the train_text file, but I don’t know if it is safe or not. Besides, I think that 80 lines for training Arabic models is very small, isn’t it?!!! So what if I decided to train for more lines of data, what should I do, and what files would be affected in such case?
Regards
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 24 (7 by maintainers)
@peterbence3 The unicharset is extracted from the training_text file. You don’t need so many steps.
I used the following script:
@peterbence3 for extending the training_text for finetuning - you can try to reengineer the files from the trainnedata.
@Shreeshrii thanks a lot, now its working super fine, steps I followed:
1. clone tesseract, followed installation instructions to build and install from source
2. clone langdata-lstm
3. getting the tessdata_best/ara.traineddata and place it at
tessertact/tessdatainside the project cloned in step one4. edit
langdata_lstm/ara/ara.traning_textby adding few more lines5. download a set of Arabic fonts that i need to fine tune for (place them in any folder)
6. generating the train data files as follows:
this will generate a new train data, including the
ara.training_files.txtplus a folder named ‘ara’ that contains a starterara.traineddata(actually not trained yet) containing yourunicharsetfile that was automatically generated bytesstrain.shfor your customara.training_text.7. extracting
ara.lstmfrom theara.traineddatafile to continue training from it later using:combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm8. now everything is ready, execute the fine-tuning like:
9. enjoy with no encoding errors
Thanks all
It is generated by tesstrain.sh - see https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain_utils.sh#L346