core: page_validator.py produces wrong concatenated text
In get_text(), the TextEquiv with index=1 is used if it exists. The way I read the documentation of the index attribute in the PAGE schema, it should use the one with the lowest index:
Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.
The lowest possible value for index is 0, according to the schema.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 24 (19 by maintainers)
Commits related to this issue
- fix page textequiv index fixes as discuss=ed in #430 — committed to kba/ocrd-core by kba 4 years ago
I always make sure to run
make install PIP_INSTALL="pip install -e"in core to make sure core has been installed “editable”.It does say so [in our PAGE specs]:
I’m fairly certain I had a reason for that, could that be the convenion of Aletheia or TRANSKRIBUS?
Absolutely. In my current implementation in ocrd_calamari there could also be missing
indexvalues (due to unrelated reasons), which should be perfectly legal.