tesseract: Method WordFontAttributes does not work
Environment
- Tesseract Version: tesseract 4.00.00alpha
- Commit Number: 8e55e52be749b3faccca8ae41abdc0e3d3c7f887
- Platform: Ubuntu 16.04.1
Current Behavior:
Method WordFontAttributes returns null if using tesseract 4.00.00alpha with 4.00 tessdata, but it returns font name if using tesseract 4.00.00alpha with 3.04.00 tessdata. The test image link is eurotext.tif I first met this problem when I use tesserocr [tesserocr#68] .(https://github.com/sirfz/tesserocr/issues/68)
Expected Behavior:
With method WordFontAttributes we can get correct font attributes of recognized words.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 10
- Comments: 41 (6 by maintainers)
Commits related to this issue
- Make font size estimation work with the lstm engine (#1173) **Partial** fix for issue #1074 — committed to tesseract-ocr/tesseract by amitdo 7 years ago
I have reasons to believe that the new LSTM engine is unlikely to have a feature that includes font identification (name and properties like is_bold) in the near future.
Important note: I’m a contributer from the community, and the main developer not always shares all his plans for upcoming release(s) with the community.
The problem:
With the LSTM engine the
it_->word()->fontinfowill always beNULL. So pointsize has no chance to be calculated.pointsize is calculated based on row (=line) height. pointsize is the font size in points of the line, so it should not be in WordFontAttributes().
There is another function where you can get row height.
I think pointsize calculation should be moved into this function.
Hello! Is this issue still open? I need to get some font properties from scanned pdf like when text is bold or underlined. WordFontAttribute is returning None, any suggestion on what I can use to get these properties?
Thanks!
It would be feasible to add bold and italic attributes by making them a separate output from the model. Underline would also be possible. All these attributes would require changes to the rendering pipeline, and datapath for the ground truth. Fixed-pitch(monospace), serif and smallcaps would be much more difficult, due to lack of reliable data available for the fonts. It could be possible to re-use the existing fontinfo table for that. I wouldn’t rule it out as impossible, but I will add this request to my list of stoppers for obsoleting the old engine. I have a bunch of updates to push, which I didn’t quite get to before my office move…
The relative font size for a textline can be estimated by calculating the xheight of the line and compare it to the median xheight of the other textlines in the page.