core: parse fails to validate result of to_xml
I get a regression with 1.0.0b11: The call to page_from_file fails at ocrd_models_generateds.parse on a file previously generated by ocrd_models.ocrd_page.to_xml. (It mocks in validate_ConfSimpleType that the value is a str instead of a number.)
This is what I did:
ocrd-asv-ann-evaluate -m $mets -I OCR-D-GT-SEG-LINE,OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP
where all the OCR file grps are from a previous recognize processor in a long chain that runs through ok. See here for what the processor does.
This is what happens:
16:05:16.373 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0001
16:05:16.375 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.378 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.381 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.383 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.385 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.387 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0002
16:05:16.389 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.391 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.393 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.396 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.399 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.401 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0003
16:05:16.402 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.405 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.407 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.410 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.412 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.415 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0004
16:05:16.417 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.419 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.422 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.424 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.427 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.430 INFO processor.EvaluateLines - processing page phys_0001
16:05:16.431 INFO processor.EvaluateLines - INPUT FILE for OCR-D-GT-SEG-LINE: OCR-D-GT-SEG-LINE_0001
16:05:16.465 INFO processor.EvaluateLines - INPUT FILE for OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP: OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001
Traceback (most recent call last):
File "/home/xbert/unsortiert/arbeit/heyer/tools/ocrd_tesserocr/env3/bin/ocrd-asv-ann-evaluate", line 11, in <module>
load_entry_point('ocrd-cor-asv-ann', 'console_scripts', 'ocrd-asv-ann-evaluate')()
File "click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "click/core.py", line 717, in main
rv = self.invoke(ctx)
File "click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/cli.py", line 16, in ocrd_cor_asv_ann_evaluate
return ocrd_cli_wrap_processor(EvaluateLines, *args, **kwargs)
File "ocrd/decorators.py", line 38, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "ocrd/processor/base.py", line 65, in run_processor
processor.process()
File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/evaluate.py", line 71, in process
pcgts = page_from_file(self.workspace.download_file(input_file))
File "ocrd_modelfactory/__init__.py", line 71, in page_from_file
return parse(input_file.local_filename, silence=True)
File "ocrd_models/ocrd_page_generateds.py", line 11222, in parse
rootObj.build(rootNode)
File "ocrd_models/ocrd_page_generateds.py", line 1069, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 1084, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 2406, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 2544, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 11073, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 11155, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 3057, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 3122, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 3446, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 3499, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 3776, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 3837, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 4013, in build
self.buildAttributes(node, node.attrib, already_processed)
File "ocrd_models/ocrd_page_generateds.py", line 4030, in buildAttributes
self.validate_ConfSimpleType(self.conf) # validate type ConfSimpleType
File "ocrd_models/ocrd_page_generateds.py", line 3934, in validate_ConfSimpleType
if value < 0:
TypeError: '<' not supported between instances of 'str' and 'int'
The incriminated PAGE-XML is OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001.xml.gz. It validates fine under http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (8 by maintainers)
Commits related to this issue
- regenerate PAGE API with 2.30.11 instead of 2.33.1, fix #269 — committed to kba/ocrd-core by kba 5 years ago
- update generateDS PAGE API, #269 — committed to kba/ocrd-core by kba 4 years ago
- update generateDS PAGE API, #269 — committed to kba/ocrd-core by kba 4 years ago
- Revert "update generateDS PAGE API, #269" This reverts commit 3a0a3a8351124020bea127e9ff15e3ba63541f8f. Conflicts: tests/model/test_ocrd_page.py — committed to OCR-D/core by kba 4 years ago
The pertinent diff in the generated code:
There is not more casting to float in the current code. Hence all of
are accepted and stored as
str,intandfloatas-is but only the third one is valid. Investigating at which version between 2.30.11 and 2.33.1 this changed and whether it can be re-enabled.Sorry about that, will try to fix ASAP. I updated generateDS before regenerating the page API, maybe something changed about how the @conf attribute is parsed…