tesseract: Method WordFontAttributes does not work

Environment

  • Tesseract Version: tesseract 4.00.00alpha
  • Commit Number: 8e55e52be749b3faccca8ae41abdc0e3d3c7f887
  • Platform: Ubuntu 16.04.1

Current Behavior:

Method WordFontAttributes returns null if using tesseract 4.00.00alpha with 4.00 tessdata, but it returns font name if using tesseract 4.00.00alpha with 3.04.00 tessdata. The test image link is eurotext.tif I first met this problem when I use tesserocr [tesserocr#68] .(https://github.com/sirfz/tesserocr/issues/68)

Expected Behavior:

With method WordFontAttributes we can get correct font attributes of recognized words.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 10
  • Comments: 41 (6 by maintainers)

Commits related to this issue

Most upvoted comments

Do you mean that just this method won’t be supported, or the feature in general?

I have reasons to believe that the new LSTM engine is unlikely to have a feature that includes font identification (name and properties like is_bold) in the near future.

Important note: I’m a contributer from the community, and the main developer not always shares all his plans for upcoming release(s) with the community.

// Returns the font attributes of the current word. If iterating at a higher
// level object than words, eg textlines, then this will return the
// attributes of the first word in that textline.
// The actual return value is a string representing a font name. It points
// to an internal table and SHOULD NOT BE DELETED. Lifespan is the same as
// the iterator itself, ie rendered invalid by various members of
// TessBaseAPI, including Init, SetImage, End or deleting the TessBaseAPI.
// Pointsize is returned in printers points (1/72 inch.)
const char* LTRResultIterator::WordFontAttributes(bool* is_bold,
                                                  bool* is_italic,
                                                  bool* is_underlined,
                                                  bool* is_monospace,
                                                  bool* is_serif,
                                                  bool* is_smallcaps,
                                                  int* pointsize,
                                                  int* font_id) const {
  if (it_->word() == NULL) return NULL;  // Already at the end!
  if (it_->word()->fontinfo == NULL) {
    *font_id = -1;
    return NULL;  // No font information.
  }
  const FontInfo& font_info = *it_->word()->fontinfo;
  *font_id = font_info.universal_id;
  *is_bold = font_info.is_bold();
  *is_italic = font_info.is_italic();
  *is_underlined = false;  // TODO(rays) fix this!
  *is_monospace = font_info.is_fixed_pitch();
  *is_serif = font_info.is_serif();
  *is_smallcaps = it_->word()->small_caps;
  float row_height = it_->row()->row->x_height() +
      it_->row()->row->ascenders() - it_->row()->row->descenders();
  // Convert from pixels to printers points.
  *pointsize = scaled_yres_ > 0
      ? static_cast<int>(row_height * kPointsPerInch / scaled_yres_ + 0.5)
      : 0;

  return font_info.name;
}

The problem:

if (it_->word()->fontinfo == NULL) {
    *font_id = -1;
    return NULL;  // No font information.
}

With the LSTM engine the it_->word()->fontinfo will always be NULL. So pointsize has no chance to be calculated.

pointsize is calculated based on row (=line) height. pointsize is the font size in points of the line, so it should not be in WordFontAttributes().

There is another function where you can get row height.

void LTRResultIterator::RowAttributes(float* row_height, float* descenders,
                                      float* ascenders) const {
  *row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() -
                it_->row()->row->descenders();
  *descenders = it_->row()->row->descenders();
  *ascenders = it_->row()->row->ascenders();
}

I think pointsize calculation should be moved into this function.

Hello! Is this issue still open? I need to get some font properties from scanned pdf like when text is bold or underlined. WordFontAttribute is returning None, any suggestion on what I can use to get these properties?

Thanks!

It would be feasible to add bold and italic attributes by making them a separate output from the model. Underline would also be possible. All these attributes would require changes to the rendering pipeline, and datapath for the ground truth. Fixed-pitch(monospace), serif and smallcaps would be much more difficult, due to lack of reliable data available for the fonts. It could be possible to re-use the existing fontinfo table for that. I wouldn’t rule it out as impossible, but I will add this request to my list of stoppers for obsoleting the old engine. I have a bunch of updates to push, which I didn’t quite get to before my office move…

The relative font size for a textline can be estimated by calculating the xheight of the line and compare it to the median xheight of the other textlines in the page.