tesseract: Tesseract segmentation fault when using Arabic and English
When I tried the Arabic only and English only text copying it worked. However when I tried to use them both simultaneously on the picture of the scanned page I got a ‘segmentation fault’. I have attached a link to the image of a scanned page of the Arabic-English dictionary : https://imgur.com/a/K8bqz.
My bashscript was:
tesseract Arabic_to_English.png -l eng+ara output
However the terminal returned the message that there was a ‘segmentation fault’. Full error message:
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Detected 24 diacritics
no best words!!
no best words!!
no best words!!
no best words!!
Segmentation fault (core dumped)
I wanted to ask whether tesseract is able to work with English and Arabic simultaneously.
Environment
- Tesseract Version: tesseract 3.04.01 leptonica-1.73 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
Current Behavior:
Expected Behavior:
Suggested Fix:
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 19 (1 by maintainers)
For enabling the debug info related to this,
update the config called
logfileto the following and then use ‘logfile’ as the last variable in your command.logfile config
command
The tesseract.log generated by above will be on the following lines.
–oem 1 is LSTM, and --oem 3 is default - which should fallback to --oem 1. So the results should be the same.
Arabictraineddata, which is different fromara. It has both Arabic and English.see https://github.com/tesseract-ocr/tesseract/wiki/APIExample#result-iterator-example
https://github.com/tesseract-ocr/tesseract/issues/681
There is a debug type of config variable you can set to see details such as https://github.com/tesseract-ocr/tesseract/issues/681#issuecomment-275389685
An interesting discussion: https://english.stackexchange.com/questions/424366/does-op-mean-original-poster-or-original-post