OCRmyPDF: Bad OCR results

I am using

ocrmypdf -l deu in.pdf out.pdf

but the OCR results are quite disappointing. The text is very clear and should be easy to detect. In this case “Dienstleistung” seems to be detected as “Diensttleistuung” (at least that’s what c&p reveals - but that is also consistent with searching inside the PDF).

screen shot 2018-09-11 at 02 46 21

I am using:

ocrmypdf 7.0.4
tesseract 3.05.02
on macOS 10.13.6

I looked at https://github.com/tesseract-ocr/tessdata_best but I guess those are too new?

Any advice?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 16

Commits related to this issue

Most upvoted comments

Do you have Ghostscript 9.24? (gs --version)

9.24 seems to have broken many things 😦. Using Ghostscript 9.24, the sidecar file contains only the first page, but in 9.23 the sidecar file contains all text.

The text file confirms that Tesseract is working.

It is quite likely a Preview problem. Text extraction from PDF is necessarily a heuristic because (as a type of print media) PDF does not have a concept of “words”, just objects that are printed are specific. Apple seems to have little interesting in fixing PDF display issues in Preview. Preview and Evince are bad and worse at text extraction respectively. If you can check Acrobat, it does a better job.

You can also use Ghostscript txtwrite to extract text: https://www.ghostscript.com/doc/9.21/VectorDevices.htm#TXT

as another way to view the output.

That being said, it may be helpful I can view the PDF. That would be a way to check if there is anything that can be done to improve the output. If you are concerned about sharing the file publicly, you can encrypt it with my public key as described here: https://github.com/jbarlow83/OCRmyPDF/wiki