tesseract: Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system.
Environment
- Tesseract Version:
> tesseract -v
tesseract 4.0.0
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
tesseract-snap -v
tesseract 5.0.0-alpha-335-gae02
leptonica-1.74.2
libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
Found AVX2
Found AVX
Found FMA
Found SSE
tesseract-ocr-eng : 1:4.00~git30-7274cfa-1
I used the training data from the ubuntu repos for both tesseract and tesseract-snap , since no data is provided with the snap.
- Platform: Operating System: Kubuntu 19.04 KDE Plasma Version: 5.15.4 KDE Frameworks Version: 5.56.0 Qt Version: 5.12.2 Kernel Version: 5.0.0-21-generic OS Type: 64-bit Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz Memory: 11.6 GiB of RAM
Current Behavior:
It takes over a minute of 100% CPU load to scan an image (directly below) with two sentences :
results for tesseract 4:
> time tesseract -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
real 1m9.096s
user 3m7.484s
sys 0m0.335s
Tesseract 5:
> time tesseract-snap -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract Open Source OCR Engine v5.0.0-alpha-335-gae02 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
real 1m13.585s
user 3m16.104s
I tried to OCR a one page doc, but I had to exit the psenterocess. It would probably take one hour of full CPU load.Unfortunately I don’t have Tesseract 3 to compare, but I remember using it in an OCR screenshotting script it felt as fast as regular copy and paste, so definitely under two seconds for this block of text.
Expected Behavior:
It shouldn’t take this long to scan two sentences.
Suggested Fix
Disable multithreading by default until its fixed.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 52 (16 by maintainers)
Commits related to this issue
- Decide on OMP_THREAD_LIMIT more intelligently — committed to ocrmypdf/OCRmyPDF by deleted user 5 years ago
- Use at most 3 Tesseract threads Based on a user suggestion and tesseract-ocr/tesseract#2611, I reviewed thread limits and found that thread limit of 3 is still beneficial, but not 4. > time env OMP_... — committed to ocrmypdf/OCRmyPDF by deleted user 5 years ago

Out of the box, it takes about one hour to OCR a single page of text. It would take one month to OCR a textbook, and the CPU would probably fry. I think most users would consider this “completely broken,” in the sense of not being usable.
The issue affects both AVX and non-AVX systems. The program is capable of cutting down times by two orders of magnitude in both cases, as demonstrated in this thread. Why not just limit the core count by default until the issue is fixed?
Of course, one could argue that it’s up to application developers to make sure tesseract works on the target system. (I just tried a few OCR apps and most of them work fine - so it looks like they are fixing it on their end somehow).