tesseract: Can't encode transcription -- Arabic numbers

I’m trying to finetune tesseract and get this error when the sentence includes Arabic numbers:

train-ocr | Encoding of string failed! Failure bytes: d9 a4 d9 a4 d9 a0 20 d9 86 d8 a7 d8 b6 d9 85 d8 b1 20 d9 a0 d9 a2 20 d8 a9 d9 8a d9 84 d9 8a d9 82 d9 84 d9 82 20 d9 a2 d9 a1 d9 a1 d9 a0 d9 a5 d9 a8 d9 a2 20 d9 a0 d9 a0 d9 a1 d9 a4 20 d9 85 20 d8 aa d9 a3 d9 a4 20 d9 a0 d9 a9 20 d9 89 d9 87 d8 aa d9 86 d8 a7 20 d8 a7 d9 8a d8 a8 d9 8a d9 84 20 d9 8a d9 81 20 d9 8a d9 84 d8 a7 d8 ad d9 84 d8 a7 20 d8 b9 d8 b6 d9 88 d9 84 d8 a7 20 d9 86 d9 8a d8 aa d9 88 d8 a8 20 d8 b1 d9 8a d9 85 d9 8a d8 af d8 a7 d9 84 d9 81 20 d9 8a d8 b3 d9 88 d8 b1 d9 84 d8 a7 20 d8 b3 d9 8a d8 a6 d8 b1 d9 84 d8 a7 20 d8 b9 d9 85 20 d9 8a d9 81 d8 aa d8 a7 d9 87 20 d9 84 d8 a7 d8 b5 d8 aa d8 a7 20 d9 8a d9 81 20 d8 b3 d9 85 d8 a7 20 d9 84 d9 83 d8 b1 d9 8a d9 85 20 d8 a7 d9 84 d9 8a d8 ac d9 86 d8 a7 20 d8 a9 d9 8a d9 86 d8 a7 d9 85 d9 84 d8 a7 d9 84 d8 a7 20 d8 a9 d8 b1 d8 a7 d8 b4 d8 aa d8 b3 d9 85 d9 84 d8 a7 20 d8 a9 d8 b4 d9 82 d8 a7 d9 86 d9 85 d8 a8 20 d8 a7 d9 85 d8 a7 d9 85 d8 aa d9 87 d8 a7 20 d9 81 d8 ad d8 b5 d9 84 d8 a7
train-ocr | Can't encode transcription: 'ـه ١٤٤٠ ناضمر ٠٢ ةيليقلق ٢١١٠٥٨٢ ٠٠١٤ م ت٣٤ ٠٩ ىهتنا ايبيل يف يلاحلا عضولا نيتوب ريميدالف يسورلا سيئرلا عم يفتاه لاصتا يف سما لكريم اليجنا ةيناملالا ةراشتسملا ةشقانمب امامتها فحصلا' in language ''
train-ocr | Encoding of string failed! Failure bytes: d9 a7
train-ocr | Can't encode transcription: '٧' in language ''

I’m using the langdata from the LSTM model, as well as the arabic best model

I checked the unicharset file and these numbers are indeed there, am I missing something?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 26

Most upvoted comments

You are using the wrong traineddats in your lstmtraining command.

It needs to be https://github.com/raghada/tesseract_finetuning/blob/master/src/train/ara/ara.traineddata

Basically, the old traineddata is the startmodel from which you extracted the lstm file.

The traineddata is the new starter traineddata created by tesstrain.sh in train/ara directory, based on the training text you use.

@raghada can you please update us on your work? Did you manage to get a trained data that works well on both numbers and letters?