tesseract: Unable to detect simple math equations using pytessract

Similar to https://github.com/tesseract-ocr/tesseract/issues/2204 and https://github.com/tesseract-ocr/tesseract/issues/1890

Environment

  • Tesseract Version: tesseract v5.0.0-alpha.20200328
  • Platform: Windows 10 64-bit

Current Behavior:

I download latest traineddata files from tessdata I tried following code

img = cv2.imread(file_path)
hocr_data = pytesseract.image_to_pdf_or_hocr(img, extension='hocr', lang="eng+equ")
with open("test.html", 'w+b') as f:
     f.write(hocr_data)

Input file 006

Output looks something like chrome_k33xFY87or

Expected Behavior:

It should detect those equations.

Suggested Fix:

Use LaTaX as a new language to detecting math equations. Then we can easily put those LaTax math equations into hocr file as mentioned at this link

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 20 (3 by maintainers)

Most upvoted comments

Leaving it for now may be i will try it later

But i think its much more better if tesseract should have LaTax as an other language option for reseach documents or any kind of documents that have math equations.

tesseract is not suitable for this king of input text.