html5lib-python: lxml doesn’t like control characters
Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.
Each of these trigger the exception below:
html5lib.parse('<p>', treebuilder='lxml')
html5lib.parse('<p>\x01', treebuilder='lxml')
html5lib.parse('<p id="">', treebuilder='lxml')
html5lib.parse('<p id="\x01">', treebuilder='lxml')
Traceback (most recent call last):
File "/tmp/a.py", line 4, in <module>
html5lib.parse('<p>', treebuilder='lxml')
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse
return p.parse(doc, encoding=encoding)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse
parseMeta=parseMeta, useChardet=useChardet)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse
self.mainLoop()
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop
new_token = phase.processCharacters(new_token)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters
self.tree.insertText(token["data"])
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText
parent.insertText(data)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
builder.Element.insertText(self, data, insertBefore)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText
self._element.text += data
File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)
File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)
File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:
DataLossWarning: Text cannot contain U+000C
libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.
About this issue
- Original URL
- State: open
- Created 11 years ago
- Comments: 25 (15 by maintainers)
Commits related to this issue
- Replace invalid characters with U+FFFD (fixes #96) — committed to lastorset/html5lib-python by deleted user 10 years ago
- Replace invalid characters with U+FFFD (fixes #96) — committed to lastorset/html5lib-python by deleted user 10 years ago
@lpla Ours have evolved after slowly correcting errors when parsing erroneously encoded text in hundreds of thousands of HTML e-mails. This is the current version we are using, compatible with both python 2 (narrow and wide builds) and python 3, and with type hints:
Here’s a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib:
I hereby release it as public domain.