tesseract: Unable to detect simple math equations using pytessract
Similar to https://github.com/tesseract-ocr/tesseract/issues/2204 and https://github.com/tesseract-ocr/tesseract/issues/1890
Environment
- Tesseract Version: tesseract v5.0.0-alpha.20200328
- Platform: Windows 10 64-bit
Current Behavior:
I download latest traineddata files from tessdata I tried following code
img = cv2.imread(file_path)
hocr_data = pytesseract.image_to_pdf_or_hocr(img, extension='hocr', lang="eng+equ")
with open("test.html", 'w+b') as f:
f.write(hocr_data)
Input file

Output looks something like

Expected Behavior:
It should detect those equations.
Suggested Fix:
Use LaTaX as a new language to detecting math equations. Then we can easily put those LaTax math equations into hocr file as mentioned at this link
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 20 (3 by maintainers)
Leaving it for now may be i will try it later
But i think its much more better if tesseract should have LaTax as an other language option for reseach documents or any kind of documents that have math equations.
Try this: https://github.com/lukas-blecher/LaTeX-OCR.
For a full ocr version, try this : https://github.com/breezedeus/Pix2Text
tesseract is not suitable for this king of input text.