core: parse fails to validate result of to_xml

I get a regression with 1.0.0b11: The call to page_from_file fails at ocrd_models_generateds.parse on a file previously generated by ocrd_models.ocrd_page.to_xml. (It mocks in validate_ConfSimpleType that the value is a str instead of a number.)

This is what I did:

ocrd-asv-ann-evaluate -m $mets -I OCR-D-GT-SEG-LINE,OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP

where all the OCR file grps are from a previous recognize processor in a long chain that runs through ok. See here for what the processor does.

This is what happens:

16:05:16.373 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0001
16:05:16.375 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.378 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.381 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.383 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.385 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.387 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0002
16:05:16.389 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.391 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.393 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.396 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.399 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.401 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0003
16:05:16.402 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.405 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.407 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.410 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.412 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.415 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0004
16:05:16.417 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.419 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.422 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.424 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.427 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.430 INFO processor.EvaluateLines - processing page phys_0001
16:05:16.431 INFO processor.EvaluateLines - INPUT FILE for OCR-D-GT-SEG-LINE: OCR-D-GT-SEG-LINE_0001
16:05:16.465 INFO processor.EvaluateLines - INPUT FILE for OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP: OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001
Traceback (most recent call last):
  File "/home/xbert/unsortiert/arbeit/heyer/tools/ocrd_tesserocr/env3/bin/ocrd-asv-ann-evaluate", line 11, in <module>
    load_entry_point('ocrd-cor-asv-ann', 'console_scripts', 'ocrd-asv-ann-evaluate')()
  File "click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/cli.py", line 16, in ocrd_cor_asv_ann_evaluate
    return ocrd_cli_wrap_processor(EvaluateLines, *args, **kwargs)
  File "ocrd/decorators.py", line 38, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "ocrd/processor/base.py", line 65, in run_processor
    processor.process()
  File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/evaluate.py", line 71, in process
    pcgts = page_from_file(self.workspace.download_file(input_file))
  File "ocrd_modelfactory/__init__.py", line 71, in page_from_file
    return parse(input_file.local_filename, silence=True)
  File "ocrd_models/ocrd_page_generateds.py", line 11222, in parse
    rootObj.build(rootNode)
  File "ocrd_models/ocrd_page_generateds.py", line 1069, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 1084, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 2406, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 2544, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 11073, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 11155, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3057, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3122, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3446, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3499, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3776, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3837, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 4013, in build
    self.buildAttributes(node, node.attrib, already_processed)
  File "ocrd_models/ocrd_page_generateds.py", line 4030, in buildAttributes
    self.validate_ConfSimpleType(self.conf)    # validate type ConfSimpleType
  File "ocrd_models/ocrd_page_generateds.py", line 3934, in validate_ConfSimpleType
    if value < 0:
TypeError: '<' not supported between instances of 'str' and 'int'

The incriminated PAGE-XML is OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001.xml.gz. It validates fine under http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (8 by maintainers)

Commits related to this issue

Most upvoted comments

The pertinent diff in the generated code:

-            try:
-                self.conf = float(value)
-            except ValueError as exp:
-                raise ValueError('Bad float/double attribute (conf): %s' % exp)
+            self.conf = value
+            self.validate_ConfSimpleType(self.conf)    # validate type ConfSimpleType

There is not more casting to float in the current code. Hence all of

set_conf("1")
set_conf(int(1))
set_conf(1.0)

are accepted and stored as str, int and float as-is but only the third one is valid. Investigating at which version between 2.30.11 and 2.33.1 this changed and whether it can be re-enabled.

Sorry about that, will try to fix ASAP. I updated generateDS before regenerating the page API, maybe something changed about how the @conf attribute is parsed…