tesseract: Tesseract 4.0 hangs when processing a particular image

Environment

  • Tesseract Version: tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
  • Platform: Ubuntu 18.04.1 LTS

Current Behavior:

hangs when running the following command: tesseract failed-image.jpeg output.txt

output message:

Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 207

Tesseract does not stop nor give any message after that. other images work fine, i only have trouble processing this particular image. I have found that the image after processed by tesseract (or leptonica?) is weird, dont know if it is related.

failed-image.jpeg: https://drive.google.com/open?id=1HsgCbtuNpgf_XxzjkekXU9-uuiWDsV0H tessinput.tif: https://drive.google.com/open?id=1sE8Nn5rykSWPT6PMF3nFSonPMT9y-H61

Expected Behavior:

Tesseract should either give an error message or finish ocr on the image even if the image quality is bad.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 18 (7 by maintainers)

Commits related to this issue

Most upvoted comments

@saikalyan9981 Works fine with current code from repo. Time taken is different based on the traineddata file being used.

(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata_best
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m33.252s
user    1m47.232s
sys     0m0.826s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata_fast
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m12.468s
user    0m30.834s
sys     0m0.593s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m18.681s
user    0m53.303s
sys     0m0.714s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 0
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.286s
user    0m54.827s
sys     0m0.696s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 1
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m18.088s
user    0m51.650s
sys     0m0.760s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 2
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.176s
user    0m54.583s
sys     0m0.744s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 3
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.216s
user    0m54.951s
sys     0m0.682s