ocr-fileformat: alto2hocr: Content in BottomMargin is not considered (PrintSpace node is missing in this example)

cf #95

I am targeting hocr and trying to do so from the ABBYY latest form of alto. The header for the latter is

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto-v2.0.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<OCRProcessing ID="IdOcr"><ocrProcessingStep><processingDateTime>2019-08-29</processingDateTime><processingSoftware><softwareCreator>ABBYY</softwareCreator><softwareName>ABBYY FineReader Engine</softwareName><softwareVersion>12</softwareVersion></processingSoftware></ocrProcessingStep></OCRProcessing>
</Description>
<Styles>
</Styles>
...

But when I run

ocr-transform alto2.0 hocr in.alto out.hocr

I only get a header and no content:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="" lang=""><head><title>Image: </title><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><meta name="ocr-system" content="ABBYY FineReader Engine 12"/><meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word"/></head><body><div class="ocr_page" id="Page1" title="image ; bbox 0 0 2480 3507; ppageno 0"/><div class="ocr_page" id="Page2" title="image ; bbox 0 0 2480 3507; ppageno 0"/></body></html>

@zuphilip Any ideas on how to proceed?

Thanks!

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 15 (12 by maintainers)

Most upvoted comments

@kba I do not see any content in the margin elements - there will be no output produced by the transformation.

I think the Top and Bottom margins have been fixed now.

What to do with the Left and Right margins? There are no respective float elements specified in the hOCR spec (like ocr_header and ocr_footer).

If there are no real life examples with Left/Right margins I suggest to close this issue - and create another one here https://github.com/filak/hOCR-to-ALTO if it pop up someday. We can discuss then how to implement it.