hydroshare: invalid XML characters present in abstracts

@martinseul @aphelionz @dtarb When running check_bag on all resources, the resource below crashed on bag creation, because there are invalid characters in the resource abstract when creating resourcemetadata.xml. Thus the bag for this resource cannot be created. The only way that this could happen is if the UI and/or REST API allowed the resource abstract to contain invalid characters. This was tested on cuahsi-dev-2.hydroshare.org using an image of the production system.

bag bags/1a7dc5d6b9fa4253bca442341eef500d.zip NOT FOUND
metadata_dirty is True
bag_modified is True
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/hydroshare/hs_core/management/commands/check_bag.py", line 194, in handle
    check_bag(r.short_id, options)
  File "/hydroshare/hs_core/management/commands/check_bag.py", line 48, in check_bag
    create_bag_files(resource)
  File "/hydroshare/hs_core/hydroshare/hs_bagit.py", line 92, in create_bag_files
    out.write(resource.get_metadata_xml())
  File "/hydroshare/hs_core/models.py", line 1949, in get_metadata_xml
    include_format_elements=include_format_elements)
  File "/hydroshare/hs_core/models.py", line 3935, in get_xml
    dcterms_abstract.text = self.description.abstract
  File "src/lxml/lxml.etree.pyx", line 1031, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:51697)
  File "src/lxml/apihelpers.pxi", line 711, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:23071)
  File "src/lxml/apihelpers.pxi", line 699, in lxml.etree._createTextNode (src/lxml/lxml.etree.c:22931)
  File "src/lxml/apihelpers.pxi", line 1439, in lxml.etree._utf8 (src/lxml/lxml.etree.c:30219)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

I have modified check_bag to watch for this error and am running it again on all resources to check the extent of the bug.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

@horsburgh @dtarb @aphelionz @martinseul Try http://cuahsi-dev-2.hydroshare.org now.

  • Multiple newline sequences are replaced with ¶ (Unicode paragraph symbol)
  • Isolated newlines (e.g., a line break) are replaced with ⏎ (Unicode symbol for carriage return)
  • These are padded with spaces.
  • Either one or the other appears. There are no combinations.

This was in the spirit of the discussion on the call yesterday.

  • The exact abstract can be reproduced from the depiction by converting Unicode characters ⏎, ¶ back to \n or \n\n as appropriate.
  • This is consistent with what the UI does: \n\n to paragraph break. \n to <br>