tesseract: Tesseract segmentation fault when using Arabic and English

When I tried the Arabic only and English only text copying it worked. However when I tried to use them both simultaneously on the picture of the scanned page I got a ‘segmentation fault’. I have attached a link to the image of a scanned page of the Arabic-English dictionary : https://imgur.com/a/K8bqz.

My bashscript was:

tesseract Arabic_to_English.png -l eng+ara output

However the terminal returned the message that there was a ‘segmentation fault’. Full error message:

Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Detected 24 diacritics
no best words!!
no best words!!
no best words!!
no best words!!
Segmentation fault (core dumped)

I wanted to ask whether tesseract is able to work with English and Arabic simultaneously.

Environment

Tesseract Version: tesseract 3.04.01 leptonica-1.73 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

Current Behavior:

Expected Behavior:

Suggested Fix:

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 19 (1 by maintainers)

Most upvoted comments

For enabling the debug info related to this,

update the config called logfile to the following and then use ‘logfile’ as the last variable in your command.

logfile config

debug_file tesseract.log
multilang_debug_level 3
stopper_debug_level 3

command

time tesseract --tessdata-dir /tesseract_ocr/tessdata_fast/   "${img_file}" "${img_file%.*}-Arabic-tessdata_fast-debug"  --oem 1  -l Arabic+ara --psm 6 logfile

The tesseract.log generated by above will be on the following lines.

Processing word with lang Arabic at:Bounding box=(93,1820)->(126,1851)
Trying word using lang Arabic, oem 1
Best choice: accepted=0, adaptable=0, done=1 : Lang result : ینہ : R=2.97306, C=-8.12302, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM
str	ی	ن	ہ
state:	1 	1 	1 
C	-0.195	-0.324	-1.160
1 new words better than 0 old words: r: 2.97306 v 0 c: -8.12302 v 0 valid dict: 1 v 0
Trying word using lang ara, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : ىنب : R=3.02201, C=-2.14964, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM
str	ى	ن	ب
state:	1 	1 	1 
C	-0.208	-0.307	-0.297
1 new words worse than 1 old words: r: 3.02201 v 2.97306 c: -2.14964 v -8.12302 valid dict: 0 v 1

Shreeshrii on Jan 25, 2018

tessdata_best and tessdata_fast do NOT have the legacy tesseract model in it, hence --oem 0 (tesseract) and --oem 2 (tesseract+LSTM) won’t work.

–oem 1 is LSTM, and --oem 3 is default - which should fallback to --oem 1. So the results should be the same.

Please also test with Arabic traineddata, which is different from ara. It has both Arabic and English.

Shreeshrii on Jan 20, 2018

Is there anywhere to extract confidence scores per letter/character?

see https://github.com/tesseract-ocr/tesseract/wiki/APIExample#result-iterator-example

https://github.com/tesseract-ocr/tesseract/issues/681

There is a debug type of config variable you can set to see details such as https://github.com/tesseract-ocr/tesseract/issues/681#issuecomment-275389685

Shreeshrii on Jan 21, 2018

I have updated my OP. I did not download any test data

An interesting discussion: https://english.stackexchange.com/questions/424366/does-op-mean-original-poster-or-original-post

amitdo on Jan 15, 2018