tesseract: Unable to detect simple math equations using pytessract

Environment

Tesseract Version: tesseract v5.0.0-alpha.20200328
Platform: Windows 10 64-bit

Current Behavior:

I download latest traineddata files from tessdata I tried following code

img = cv2.imread(file_path)
hocr_data = pytesseract.image_to_pdf_or_hocr(img, extension='hocr', lang="eng+equ")
with open("test.html", 'w+b') as f:
     f.write(hocr_data)

Input file 006

Output looks something like chrome_k33xFY87or

Expected Behavior:

It should detect those equations.

Suggested Fix:

Use LaTaX as a new language to detecting math equations. Then we can easily put those LaTax math equations into hocr file as mentioned at this link

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 20 (3 by maintainers)

Most upvoted comments

Leaving it for now may be i will try it later

But i think its much more better if tesseract should have LaTax as an other language option for reseach documents or any kind of documents that have math equations.

+11

NavpreetDevpuri on Jun 19, 2020

Try this: https://github.com/lukas-blecher/LaTeX-OCR.

For a full ocr version, try this : https://github.com/breezedeus/Pix2Text

Shadow-Alex on Jun 22, 2023

tesseract is not suitable for this king of input text.

zdenop on Jun 19, 2020