tesseract: Incorrect bounding boxes

I use the latest release 4.1 with LSTM only and with best traindata files

https://github.com/tesseract-ocr/tesseract/archive/4.1.0-rc1.tar.gz

Just to give an example

The two amounts to the right 528,00 and 72,00 overlap each other in the OCR results but does not overlap in the input image

Here is a link the the preprocessed image (tiff) before sending it to tesseract https://imgur.com/a/12qqobk

They intersect with 10 px (1353 - 1343) even though they are far from each other

Bounding box for 528,00:

[top] => 1317
[bottom] => 1353
[left] => 2089
[right] => 2218
[width] => 129
[height] => 36
[value] => 528,00
[conf] => 96.28

Bounding box for 72,00:

[top] => 1343
[bottom] => 1408
[left] => 2112
[right] => 2211
[width] => 99
[height] => 65
[value] => 72,00
[conf] => 96.87

pdf_image-00

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 23 (9 by maintainers)

Most upvoted comments

No it was not arrogant: #2103 is not bug in tesseract - but wrong usage of API which is bug in YOUR code. And situation repeat here once again: I proved that problem is not in tesseract but in your pre-processing. You are asking for free support (in name of calling it “bug”) to fix your business problem. So who is acting arrogantly? Me not.

You can ask for support on user forum. Maybe somebody will be willing to help you for free. There are also several (paid) developers who did what you try to do exactly. But they will not share their knowledge for free.