OCRmyPDF: [13.4.2] lossy compression of pngs into jpegs when it shouldn't

  1. It might be just the older version, but ocrmypdf 12.7.2 seems to compress uncompressed pngs into (lossy) jpegs:
$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
$ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
$ pdfimages -list ./Example-uncompress-compress.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 4157B 4.5%

I believe it should be running the image through pngquant instead at optimize level 1.

  1. Btw, it’s probably not even worth mentioning since, looking at the changelog, I’m fairly certain you’ve already sorted it out in recent ocrmypdf versions, but small pdfs with small pngs grow instead of shrinking / remaining the same:
$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ img2pdf ./Example.png -o ./Example.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
$ stat -c "%n,%s" Example*.* | column -t -s,
Example-compress.pdf  7799
Example.pdf           3906
Example.png           2335

Though this might also be the pdf format changing to the archival specs…

  1. As a side note, if compute time isn’t a factor, I personally found ‘optipng -o7’ to produce smaller pngs than pngquant and ‘jpegrescan -i -t -v’ to produce the smallest jpeg, even compared to MozJPEG despite the author saying otherwise oddly enough.

p.s. forgot to mention the png-to-jpeg bug also happens with some compressed pngs but I haven’t bothered trying to replicate this since I believe it should never try to convert bitmap images to jpegs to begin with.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 21

Most upvoted comments

Btw, what would it take to have OCRMyPDF preserve existing PDF/A documents?

If an input document is already a valid PDF/A, and we’re only adding the text layer, and we’re not preprocessing images, we could probably keep it a PDF/A without passing through Ghostscript. It’s a special case, but it seems like a worthwhile one…

I’m guessing OCRMyPDF will only need to avoid stuff like the linearization to pass veraPDF?

Linearization is allowed in PDF/A if the PDF is 1.5 or above, IIRC.

@rmast Anything using ghostscript will run into lossy conversions in some cases since ghostscript doesn’t support the same image formats and color profiles as pdf.

The specific issue here is a nuance of:

Ghostscript may transcode grayscale and color images, either lossy to lossless or lossless to lossy, based on an internal algorithm. This behavior can be suppressed by setting --pdfa-image-compression to jpeg or lossless to set all images to one type or the other. Ghostscript has no option to maintain the input image’s format. (Ghostscript 9.25+ can copy JPEG images without transcoding them; earlier versions will transcode.)

( https://ocrmypdf.readthedocs.io/en/latest/introduction.html#limitations )

I’ve raised a similar issue with pdfScale.sh where I’ve also made some test scripts to illustrate the issue: https://github.com/tavinus/pdfScale/issues/27

I should have closed the issue since it’s known and documented but since --output-type pdf isn’t the default behavior I figured I should leave it up to the dev to decide whether to close the issue or not since it’s still technically there.

Anyhow, PDF24 is closed source freeware so I won’t look too much into it but if it uses ghostscript, it will have to deal with similar issues.

Otherwise, FBCNN seems like a nice image restoration neural net model (I’ve personally used waifu2x for sheet music scaling before OCRing with Audiveris with good results) but it’s still a lossy process so it’s only appropriate as a mid-stage before running Tesseract.