OCRmyPDF: Arabic script is backwards and improperly aligned in output searchable PDFs

Describe the bug Output PDF files do not properly OCR Arabic text. It is backwards. For example, a word like orange is displayed as egnaro. Also, text is improperly aligned in PDF files.

To Reproduce What command line or API call were you trying to run?

ocrmypdf -l ara --sidecar output.txt input.png output.pdf --image-dpi 300

(I had OCRmyPDF work on an input image. I reproduced the same results with an input PDF.)

Logs

C:\Users\COMPUTER\Desktop>ocrmypdf -l ara --sidecar output.txt arabic1.png output.pdf --image-dpi 300 -v1
ocrmypdf 11.5.0
Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '--list-langs']
stdout/stderr = List of available languages (4):
ara
eng
osd
script/Arabic

Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '--version']
Found tesseract 5.0.0-alpha.20201127
Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '-l', 'ara', '--print-parameters', 'pdf']
Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '--version']
Found gs 9.53.3
pikepdf mmap disabled
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
imgformat = PNG
input dpi = 96 x 96
rotation = 0°
input colorspace = RGB
width x height = 671px x 949px
read_images() embeds a PNG
Successfully converted to PDF, processing...
pikepdf mmap disabled
Scanning contents: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.10page/s]
Using Tesseract OpenMP thread limit 3
pikepdf mmap disabled
    1 Rasterize with png16m, rotation 0
    1 Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r300.000000x300.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\origin.pdf']
    1 STREAM b'IHDR' 16 13
    1 STREAM b'iCCP' 41 2354
    1 iCCP profile name b'default_rgb.icc'
    1 Compression method 0
    1 STREAM b'pHYs' 2407 9
    1 STREAM b'tEXt' 2428 31
    1 STREAM b'IDAT' 2471 8192
    1 Rotating output by 0
    1 STREAM b'IHDR' 16 13
    1 STREAM b'iCCP' 41 2350
    1 iCCP profile name b'ICC Profile'
    1 Compression method 0
    1 STREAM b'pHYs' 2403 9
    1 STREAM b'IDAT' 2424 65536
    1 resolution (300, 300)
    1 Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '-l', 'ara', '-c', 'textonly_pdf=1', WindowsPath('C:/Users/COMPUTER/AppData/Local/Temp/ocrmypdf.io.x9r_gmfd/000001_ocr.png'), 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\000001_ocr_tess', 'pdf', 'txt']
    1 [tesseract] lots of diacritics - possibly poor OCR
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
OCR: 100%|█████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:02<00:00,  2.50s/page]
C:\Users\COMPUTER\AppData\Local\Temp\ocrmypdf.io.x9r_gmfd\sidecar.txt -> output.txt
Postprocessing...
Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\fix_docinfo.pdf', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Treating 18 as an optimization candidate
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.14page/s]
XrefExt(xref=18, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
C:\Users\COMPUTER\AppData\Local\Temp\ocrmypdf.io.x9r_gmfd\optimize.pdf -> output.pdf
Output file is a PDF/A-2B (as expected)

Example file

Expected behavior

Text should be displayed, for example, as I like OCR. Instead, it’s being displayed as RCO ekil I.
Another thing is OCR isn’t being properly aligned. For example, in this screenshot, I’m selecting one word. However, when I right-click it, not only is the text backwards, but it’s a different word. The word matches the one right above the one I selected. You can see that the red boxes are highlighting how the OCR alignment is incorrect. Another screenshot to showcase improper alignment: this is what happens when I select everything. You can see the blue is prominently below, not on, each word.
It’s important to mention that OCRmyPDF properly performs text recognition for the output TXT files. This seems to be happening mainly in PDF files.

System

OS: Windows 10
OCRmyPDF Version: 11.5.0
Tesseract version: v5.0.0-alpha.20201127
How did you install ocrmypdf: pip

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19

Most upvoted comments

@Mennaruuk الحمد لله The issue has been solved, it was a decoding and encoding problem,thank you for the follow-up.

rehamashrafshouman on Mar 21, 2021

ocrmypdf copies the output of Tesseract into the PDF essentially without modification.

jbarlow83 on Jan 19, 2021