core: Invalid structMap produced

This might be a bug in ocrd-cis actually, so beware.

We encountered a number of problems elsewhere due to an invalid physical structMap. Here, I managed to reproduce with the latest ocrd:all/maximum Docker image, with the following steps:

  1. Starting with the workspace here: https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/
  2. I removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE, using ocrd workspace remove-group -rf. → After this, the structMap is OK!
  3. Then I ran ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN → After this, the structMap is INVALID

Invalid structMap (multiple divs with same ID) after step 2, shortened to one physical page for emphasis:

  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence">
      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
        <mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
      </mets:div>

       ...

      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
      </mets:div>
  
       ...

    </mets:div>
  </mets:structMap>

(I’ll upload the full data in the comments)

This causes all kind of breakage all over the place.

What I didn’t check yet: if this only breaks with ocrd_cis, maybe @bertsky can share his debugging efforts here. I first had the impression that this breaks with add too, but as I had tried to reproduce a problem encountered by @stweil in OCR-D/quiver-benchmarks#22 it could have always been in ocrd_cis (specific workflow uses this as first step) and I could have easily confused something.

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Much simpler way to reproduce:

export OCRD_METS_CACHING=1
git clone https://github.com/OCR-D/assets
cd assets/data/SBB0000F29300010000/data
ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default-2021-03-09
eval declare -A XSD_PATHS=($(ocrd bashlib constants XSD_PATHS))
XSD_METS=${XSD_PATHS[$(ocrd bashlib constants XSD_METS_URL)]}
xmllint --schema $XSD_METS --noout mets.xml

The culprit is the caching.

Alright, so it’s not ocrd-cis, but definitely a bug in core and a severe one!