tesseract: Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system.

Environment

Tesseract Version:

> tesseract -v

tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

tesseract-snap -v

tesseract 5.0.0-alpha-335-gae02
 leptonica-1.74.2
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

tesseract-ocr-eng : 1:4.00~git30-7274cfa-1

I used the training data from the ubuntu repos for both tesseract and tesseract-snap , since no data is provided with the snap.

Platform: Operating System: Kubuntu 19.04 KDE Plasma Version: 5.15.4 KDE Frameworks Version: 5.56.0 Qt Version: 5.12.2 Kernel Version: 5.0.0-21-generic OS Type: 64-bit Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz Memory: 11.6 GiB of RAM

Current Behavior:

It takes over a minute of 100% CPU load to scan an image (directly below) with two sentences :

results for tesseract 4: > time tesseract -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1

Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    1m9.096s
user    3m7.484s
sys     0m0.335s

Tesseract 5:

> time tesseract-snap -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1

Tesseract Open Source OCR Engine v5.0.0-alpha-335-gae02 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    1m13.585s
user    3m16.104s

I tried to OCR a one page doc, but I had to exit the psenterocess. It would probably take one hour of full CPU load.Unfortunately I don’t have Tesseract 3 to compare, but I remember using it in an OCR screenshotting script it felt as fast as regular copy and paste, so definitely under two seconds for this block of text.

Expected Behavior:

It shouldn’t take this long to scan two sentences.

Suggested Fix

Disable multithreading by default until its fixed.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 52 (16 by maintainers)

Commits related to this issue

Decide on OMP_THREAD_LIMIT more intelligently — committed to ocrmypdf/OCRmyPDF by deleted user 5 years ago
Use at most 3 Tesseract threads Based on a user suggestion and tesseract-ocr/tesseract#2611, I reviewed thread limits and found that thread limit of 3 is still beneficial, but not 4. > time env OMP_... — committed to ocrmypdf/OCRmyPDF by deleted user 5 years ago

Most upvoted comments

Out of the box, it takes about one hour to OCR a single page of text. It would take one month to OCR a textbook, and the CPU would probably fry. I think most users would consider this “completely broken,” in the sense of not being usable.

The issue affects both AVX and non-AVX systems. The program is capable of cutting down times by two orders of magnitude in both cases, as demonstrated in this thread. Why not just limit the core count by default until the issue is fixed?

Of course, one could argue that it’s up to application developers to make sure tesseract works on the target system. (I just tried a few OCR apps and most of them work fine - so it looks like they are fixing it on their end somehow).

ripefig on Aug 15, 2019