OCRmyPDF: [Running in Spyder IDE] OCR section stalls out, raises OSError: [Errno 9], or output pdf isn't rotated.

Describe the bug When running ocrmypdf on a variety of files, I’ve hit the following errors in order of prevalence:

  • An error is raised with the description of OSError: [Errno 9] Bad file descriptor (unlike #109 I have 36 GB of memory)
  • OCR step runs indefinitely until I manually close the process.
    • After forcibly closing the process I sometimes also hit an “unable to load library liplept-5.dll” error, which goes away when I reinstall Tesseract. I assume it’s due to improper closure leading to file corruption.
  • The OCR step successfully completes but the output isn’t rotated.

This has happened across multiple computers, even ones with fresh installations of anaconda, tesseract, ghostscript, and ocrmypdf. I’ve always been run it on spyder using ocrmypdf as a module.

To Reproduce

  • (Optionally) Uninstall and install Tesseract OCR (and even more optionally Ghostscript and the entirety of anaconda)
  • (Optionally) Pip uninstall ocrmypdf and pip install git+github.com/jbarlow83/ocrmypdf
  • Run code given at bottom as a .py file in spyder.

Expected behavior The OCR portion of ocrmypdf will finish without error and generate files which are properly rotated.

System (please complete the following information):

  • OS: Windows 10 (Also has occurred on Windows 7)
  • Python version: 3.7.3 (Also has occurred for Python 3.6 and Python 3.8.8)
  • OCRmyPDF version: 12.2.0 (also occurred with a version in the 9’s)
  • TesseractOCR: 5.0.0-alpha.20190708 (also occurred with the latest installer from UB, which just has a different date at the end)

Installation How did you install OCRmyPDF? Did you install it from your operating system’s package manager, or using pip?

I’ve purely installed it through pip. Initially I used pip install git+github.com/jbarlow83/ocrmypdf, but after these errors started I switched to using pip install ocrmpydf.

Additional context

The script used is:


import ocrmypdf
from wand.image import Image as Img
from pathlib import Path
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

path = r"C:\Users\username\Documents"
infile = "ExampleFile.pdf"
outfile = "ExampleFile_ocr.pdf"
lang = "eng"

ocrmypdf.ocr(input_file = os.path.join(path, infile), output_file = os.path.join(path, outfile), language = lang, output_type='pdf', rotate_pages=True)

(Original code also uses wand, PIL, and pytesseract directly, to re-ocr the output of ocrmypdf as jpg blobs and extract text for translation if lang != eng)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18

Most upvoted comments

ocrmypdf is very strict about stdout. We write nothing to it unless specifically requested by the user. Any amount of chatter on stdout is a test failure.

To add a few more details in case someone happens to be knowledgeable, ocrmypdf child processes discard all of their log handlers and set up a queue handler (semaphore based, multiple producer single consumer IPC queue). A separate thread in the main process gathers all child process messages and handles them, in the default case forwarding to sys.stderr.

There’s also the mess in src/leptonica.py - depending on which leptonica is installed, we might have to redirect and un-redirect stderr before each leptonica.

I can’t reproduce this on IPython + Windows 10 (VM with 4 cores assigned), so I’m treating it a Spyder specific issue at the moment. (I haven’t tried to reproduce on Spyder.) I’ll probably add a warning that we don’t play nice with spyder.