OCRmyPDF: [Bug]: PDF/A-3B files generated with a widely used commercial encoder generate garbage OCR content

Describe the bug

I’m using OCRmyPDF as part of the Paperless-NGX document management application. Since a few weeks, I noticed that some OCR’d PDF documents do not contain any clear text anymore but only some random garbage Unicode characters. (see attached screenshot)

Upon investigation it became clear, that this is somehow related to commercially created PDFs coming e.g. from banks, insurances, etc.

There are several commonalities with the affected files, even that they come from different companies:

They are compliant to PDF/A-3B standard
They are created with the same PDF SDK (4-Heights® PDF Processing SDK; http://www.pdf-tools.com)

Earlier, when the same files were processed fine, they were created with another version of the same SDK form the same vendor: 3-Heights™ PDF to PDF/A Converter API. However, they are also PDF/A-3B documents.

I suspect something in the output of this SDK is not working correctly. So, how can we get this working again?

Steps to reproduce

Here's the output of the command that Paperless-NGX runs in the background, run interactively on the command line:

root@1be5a04be1d0:/usr/src/paperless/data# ocrmypdf --output-type pdfa --skip-text --deskew --rotate-pages --rotate-pages-threshold 12 --sidecar ocr.txt 2023-06-15\ Salt\ Mobile\ SA.pdf some-output.pdf
Scanning contents: 100%|| 3/3 [00:00<00:00, 47.37page/s]
Start processing 3 pages concurrently
    1 skipping all processing on this page                                                                                                                                    
    2 skipping all processing on this page                                                                                                                                    
    3 skipping all processing on this page                                                                                                                                    
OCR: 100%|| 3.0/3.0 [00:00<00:00, 818.51page/s]
Postprocessing...
PDF/A conversion: 100%|| 3/3 [00:00<00:00, 10.61page/s]
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
Output file is a PDF/A-2B (as expected)

Files

OCR output generated from these affected files: 2023-06-15_13h37_27

How did you download and install the software?

No response

OCRmyPDF version

14.2.1

Relevant log output

root@1be5a04be1d0:/usr/src/paperless/data# ocrmypdf -v1 --output-type pdfa --skip-text --deskew --rotate-pages --rotate-pages-threshold 12 --sidecar ocr.txt 2023-06-15\ Salt\ Mobile\ SA.pdf some-output2.pdf
ocrmypdf 14.2.1
Running: ['tesseract', '--version']
Found tesseract 5.3.0
Running: ['tesseract', '--version']
Running: ['gs', '--version']
Found gs 10.00.0
Running: ['gs', '--version']
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (6):
deu
eng
fra
ita
osd
spa

os.symlink(2023-06-15 Salt Mobile SA.pdf, /tmp/ocrmypdf.io.gylqswlb/origin)
os.symlink(/tmp/ocrmypdf.io.gylqswlb/origin, /tmp/ocrmypdf.io.gylqswlb/origin.pdf)
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 45.42page/s]
Using Tesseract OpenMP thread limit 1
Start processing 3 pages concurrently
    1 skipping all processing on this page                                                                                                                                    
    2 skipping all processing on this page                                                                                                                                    
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                        
    3 skipping all processing on this page                                                                                                                                    
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                    
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                        
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                    
    3 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                        
    3 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                    
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.0/3.0 [00:00<00:00, 423.58page/s]
/tmp/ocrmypdf.io.gylqswlb/sidecar.txt -> ocr.txt
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.gylqswlb/graft_layers.pdf, /tmp/ocrmypdf.io.gylqswlb/fix_docinfo.pdf)
Running: ['gs', '--version']
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.gylqswlb/fix_docinfo.pdf', '/tmp/ocrmypdf.io.gylqswlb/pdfa.ps']
GPL Ghostscript 10.0.0 (2022-09-21)
Copyright (C) 2022 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 3.
Page 1                                                                                                                                                                        
Page 2                                                                                                                                                                        
Page 3                                                                                                                                                                        
PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 10.30page/s]
Running: ['tesseract', '--version']
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/pdf/1.3/}PDFVersion', '{http://ns.adobe.com/xap/1.0/mm/}History', '{http://ns.adobe.com/xap/1.0/}MetadataDate', '{http://ns.adobe.com/xap/1.0/mm/}InstanceID'}
Optimizable images: JPEGs: 0 PNGs: 0
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
Optimizable images: JBIG2 groups: 0
JBIG2: 0item [00:00, ?item/s]
os.symlink(/tmp/ocrmypdf.io.gylqswlb/optimize.opt.pdf, /tmp/ocrmypdf.io.gylqswlb/optimize.pdf)
Running: ['jbig2', '--version']
Running: ['pngquant', '--version']
Optimize ratio: 1.00 savings: 0.1%
/tmp/ocrmypdf.io.gylqswlb/optimize.pdf -> some-output2.pdf
Output file is a PDF/A-2B (as expected)

About this issue

Original URL
State: closed
Created a year ago
Comments: 19 (2 by maintainers)

Most upvoted comments

Ok, I will look into it and get on touch with the Paperless folks.

I also might have a candidate file for you to check. I finally got a less sensitive bill that I could probably share. I‘ll validate the issue and then get back to you.

jce-zz on Aug 9, 2023