tesseract: Wrong page size in text only pdf file

Environment

Tesseract Version: 5.0.1
Platform: intel macos 12.2

Current Behavior:

When using an input tiff file to create a text only pdf (with -c textonly_pdf=1) the pagewidth of the output pdf is calculated using the pixel width of the input tiff divided by the resolution of the tiff. However, something goes wrong when the tiff resolution is a decimal number (float) rather than an integer (int). Tesseract seems to use floor(resolution) in that case.

E.g. an input tiff with width=726 pixels and resolution 92.202 pixels/inch results in an output pdf with page size 568.17 pts (1 pt = 1/72 inch). Tesseract used 92 pixels/inch to get to that result. (I’m limiting the example to page width, but the same happens to page height.)

Expected Behavior and suggested fix:

Tesseract should use the exact resolution of the input tiff file to calculate the output pdf file page size.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 40 (15 by maintainers)

Most upvoted comments

Thank you, Jeff.

Following your input, let’s do this: (1) I’ll fix the input resolution in the tiff reader to round instead of truncate. Note that for png, jpeg and bmp, resolution is saved as an int, and we are properly rounding whenever the units are changed (such as pixels/meter to pixels/inch for png, or pixels/cm to pixels/inch in jpeg). (2) We’ll not mess with saving resolution in the pix as a float. At 100 ppi, the maximum error from rounding is plus/minus 0.5%. For a 10 inch image, the results in a maximum page error size of 0.05 inch, slightly more than 1 mm. So this is the best we will do on this issue.

Let’s defer the library name change from liblept to libleptonica in autotools (and Debian). I still plan to do this, perhaps in a month, and we’ll see what chaos results when I put it out as 1.83.0. Question: in src/Makefile.am, will we still need the install-data-hook and uninstall-hook targets?

Changing the .so names to the canonical format and the .so number to 6.0.0 could happen later, as 1.84.0.

DanBloomberg on Mar 10, 2022

Incompatible changes to Leptonica have enough cost that they should not be done lightly. It includes recompiling all programs using Leptonica, and for those that need modification (like jbig2enc) getting the modification done, plus making them finicky about which versions of Leptonica they link against.

So I’d like to take a step back here and discuss the problem being solved.

The complaint that started this bug was about an image that is a bit under 8 inches wide. After Leptonica does some rounding (for pixels per inch) the PDF produced is the wrong size, but 0.017 inches. For those metrically inclined, we are talking 0.44 millimeters.

Perhaps this level of precision (or imprecision!) is acceptable? Could Jasper or someone else explain in more detail why this is not good enough?

Message ID: @.***>

jbreiden on Feb 16, 2022

Thanks all.

I would like to go ahead with the l_int32 --> l_float32 conversion in the resolution fields in pix and L_Compress_Data. However, I need guidance if it will cause issues with tesseract because of the use of L_Compress_Data. I would also like to up-version to so.6.0.0, and am waiting on @jbreiden there.

DanBloomberg on Feb 7, 2022

https://abi-laboratory.pro/?view=timeline&l=leptonica

This report is produced by using some open source tools: https://github.com/lvc?tab=repositories

amitdo on Feb 7, 2022