OCRmyPDF: [13.4.2] lossy compression of pngs into jpegs when it shouldn't
- It might be just the older version, but ocrmypdf 12.7.2 seems to compress uncompressed pngs into (lossy) jpegs:
$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
$ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
$ pdfimages -list ./Example-uncompress-compress.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 172 178 rgb 3 8 jpeg no 9 0 96 96 4157B 4.5%
I believe it should be running the image through pngquant instead at optimize level 1.
- Btw, it’s probably not even worth mentioning since, looking at the changelog, I’m fairly certain you’ve already sorted it out in recent ocrmypdf versions, but small pdfs with small pngs grow instead of shrinking / remaining the same:
$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ img2pdf ./Example.png -o ./Example.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
$ stat -c "%n,%s" Example*.* | column -t -s,
Example-compress.pdf 7799
Example.pdf 3906
Example.png 2335
Though this might also be the pdf format changing to the archival specs…
- As a side note, if compute time isn’t a factor, I personally found ‘optipng -o7’ to produce smaller pngs than pngquant and ‘jpegrescan -i -t -v’ to produce the smallest jpeg, even compared to MozJPEG despite the author saying otherwise oddly enough.
p.s. forgot to mention the png-to-jpeg bug also happens with some compressed pngs but I haven’t bothered trying to replicate this since I believe it should never try to convert bitmap images to jpegs to begin with.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 21
If an input document is already a valid PDF/A, and we’re only adding the text layer, and we’re not preprocessing images, we could probably keep it a PDF/A without passing through Ghostscript. It’s a special case, but it seems like a worthwhile one…
Linearization is allowed in PDF/A if the PDF is 1.5 or above, IIRC.
@rmast Anything using ghostscript will run into lossy conversions in some cases since ghostscript doesn’t support the same image formats and color profiles as pdf.
The specific issue here is a nuance of:
( https://ocrmypdf.readthedocs.io/en/latest/introduction.html#limitations )
I’ve raised a similar issue with pdfScale.sh where I’ve also made some test scripts to illustrate the issue: https://github.com/tavinus/pdfScale/issues/27
I should have closed the issue since it’s known and documented but since
--output-type pdfisn’t the default behavior I figured I should leave it up to the dev to decide whether to close the issue or not since it’s still technically there.Anyhow, PDF24 is closed source freeware so I won’t look too much into it but if it uses ghostscript, it will have to deal with similar issues.
Otherwise, FBCNN seems like a nice image restoration neural net model (I’ve personally used waifu2x for sheet music scaling before OCRing with Audiveris with good results) but it’s still a lossy process so it’s only appropriate as a mid-stage before running Tesseract.