OCRmyPDF: Arabic script is backwards and improperly aligned in output searchable PDFs

Describe the bug Output PDF files do not properly OCR Arabic text. It is backwards. For example, a word like orange is displayed as egnaro. Also, text is improperly aligned in PDF files.

To Reproduce What command line or API call were you trying to run?

ocrmypdf -l ara --sidecar output.txt input.png output.pdf --image-dpi 300

(I had OCRmyPDF work on an input image. I reproduced the same results with an input PDF.)

Logs

C:\Users\COMPUTER\Desktop>ocrmypdf -l ara --sidecar output.txt arabic1.png output.pdf --image-dpi 300 -v1
ocrmypdf 11.5.0
Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '--list-langs']
stdout/stderr = List of available languages (4):
ara
eng
osd
script/Arabic

Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '--version']
Found tesseract 5.0.0-alpha.20201127
Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '-l', 'ara', '--print-parameters', 'pdf']
Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '--version']
Found gs 9.53.3
pikepdf mmap disabled
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
imgformat = PNG
input dpi = 96 x 96
rotation = 0°
input colorspace = RGB
width x height = 671px x 949px
read_images() embeds a PNG
Successfully converted to PDF, processing...
pikepdf mmap disabled
Scanning contents: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.10page/s]
Using Tesseract OpenMP thread limit 3
pikepdf mmap disabled
    1 Rasterize with png16m, rotation 0
    1 Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r300.000000x300.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\origin.pdf']
    1 STREAM b'IHDR' 16 13
    1 STREAM b'iCCP' 41 2354
    1 iCCP profile name b'default_rgb.icc'
    1 Compression method 0
    1 STREAM b'pHYs' 2407 9
    1 STREAM b'tEXt' 2428 31
    1 STREAM b'IDAT' 2471 8192
    1 Rotating output by 0
    1 STREAM b'IHDR' 16 13
    1 STREAM b'iCCP' 41 2350
    1 iCCP profile name b'ICC Profile'
    1 Compression method 0
    1 STREAM b'pHYs' 2403 9
    1 STREAM b'IDAT' 2424 65536
    1 resolution (300, 300)
    1 Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '-l', 'ara', '-c', 'textonly_pdf=1', WindowsPath('C:/Users/COMPUTER/AppData/Local/Temp/ocrmypdf.io.x9r_gmfd/000001_ocr.png'), 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\000001_ocr_tess', 'pdf', 'txt']
    1 [tesseract] lots of diacritics - possibly poor OCR
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
OCR: 100%|█████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:02<00:00,  2.50s/page]
C:\Users\COMPUTER\AppData\Local\Temp\ocrmypdf.io.x9r_gmfd\sidecar.txt -> output.txt
Postprocessing...
Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\fix_docinfo.pdf', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Treating 18 as an optimization candidate
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.14page/s]
XrefExt(xref=18, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
C:\Users\COMPUTER\AppData\Local\Temp\ocrmypdf.io.x9r_gmfd\optimize.pdf -> output.pdf
Output file is a PDF/A-2B (as expected)

Example file image

Expected behavior

  • Text should be displayed, for example, as I like OCR. Instead, it’s being displayed as RCO ekil I.

  • Another thing is OCR isn’t being properly aligned. For example, in this screenshot, I’m selecting one word. However, when I right-click it, not only is the text backwards, but it’s a different word. The word matches the one right above the one I selected. You can see that the red boxes are highlighting how the OCR alignment is incorrect. image Another screenshot to showcase improper alignment: this is what happens when I select everything. You can see the blue is prominently below, not on, each word. image

  • It’s important to mention that OCRmyPDF properly performs text recognition for the output TXT files. This seems to be happening mainly in PDF files.

System

  • OS: Windows 10
  • OCRmyPDF Version: 11.5.0
  • Tesseract version: v5.0.0-alpha.20201127
  • How did you install ocrmypdf: pip

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19

Most upvoted comments

@Mennaruuk الحمد لله The issue has been solved, it was a decoding and encoding problem,thank you for the follow-up.

ocrmypdf copies the output of Tesseract into the PDF essentially without modification.