core: Invalid structMap produced
This might be a bug in ocrd-cis actually, so beware.
We encountered a number of problems elsewhere due to an invalid physical structMap. Here, I managed to reproduce with the latest ocrd:all/maximum Docker image, with the following steps:
- Starting with the workspace here: https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/
- I removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE, using
ocrd workspace remove-group -rf. → After this, the structMap is OK! - Then I ran
ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN→ After this, the structMap is INVALID
Invalid structMap (multiple divs with same ID) after step 2, shortened to one physical page for emphasis:
<mets:structMap TYPE="PHYSICAL">
<mets:div TYPE="physSequence">
<mets:div TYPE="page" ID="P_1879_45_0344">
<mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
<mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
</mets:div>
...
<mets:div TYPE="page" ID="P_1879_45_0344">
<mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
</mets:div>
...
</mets:div>
</mets:structMap>
(I’ll upload the full data in the comments)
This causes all kind of breakage all over the place.
What I didn’t check yet: if this only breaks with ocrd_cis, maybe @bertsky can share his debugging efforts here. I first had the impression that this breaks with add too, but as I had tried to reproduce a problem encountered by @stweil in OCR-D/quiver-benchmarks#22 it could have always been in ocrd_cis (specific workflow uses this as first step) and I could have easily confused something.
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 15 (8 by maintainers)
Much simpler way to reproduce:
The culprit is the caching.
Alright, so it’s not ocrd-cis, but definitely a bug in core and a severe one!