OCRmyPDF: Bad OCR results
I am using
ocrmypdf -l deu in.pdf out.pdf
but the OCR results are quite disappointing. The text is very clear and should be easy to detect. In this case “Dienstleistung” seems to be detected as “Diensttleistuung” (at least that’s what c&p reveals - but that is also consistent with searching inside the PDF).

I am using:
ocrmypdf 7.0.4
tesseract 3.05.02
on macOS 10.13.6
I looked at https://github.com/tesseract-ocr/tessdata_best but I guess those are too new?
Any advice?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 16
Commits related to this issue
- Blacklist Ghostscript 9.24 due to regressions - As per issue #291 — committed to ocrmypdf/OCRmyPDF by deleted user 6 years ago
- Blacklist Ghostscript 9.24 due to regressions As per issue #291. Forced push to remove a copyrighted test file that was accidentally included. — committed to ocrmypdf/OCRmyPDF by deleted user 6 years ago
Do you have Ghostscript 9.24? (
gs --version)9.24 seems to have broken many things 😦. Using Ghostscript 9.24, the sidecar file contains only the first page, but in 9.23 the sidecar file contains all text.
The text file confirms that Tesseract is working.
It is quite likely a Preview problem. Text extraction from PDF is necessarily a heuristic because (as a type of print media) PDF does not have a concept of “words”, just objects that are printed are specific. Apple seems to have little interesting in fixing PDF display issues in Preview. Preview and Evince are bad and worse at text extraction respectively. If you can check Acrobat, it does a better job.
You can also use Ghostscript txtwrite to extract text: https://www.ghostscript.com/doc/9.21/VectorDevices.htm#TXT
as another way to view the output.
That being said, it may be helpful I can view the PDF. That would be a way to check if there is anything that can be done to improve the output. If you are concerned about sharing the file publicly, you can encrypt it with my public key as described here: https://github.com/jbarlow83/OCRmyPDF/wiki