etherpad-lite: pdf export using AbiWord fails due to bad intermediate HTML

Describe the bug In trying to export to PDF using Abiword / AbiCommand, the export fails because the HTML is slightly invalid and Abiword is stupidly picky.

$ abiword --plugin AbiCommand
Unable to init server: Could not connect: Connection refused

** (abiword:1325175): WARNING **: 18:22:41.983: clutter failed 0, get a life.
Unable to init server: Could not connect: Connection refused
AbiWord command line plugin: Type "quit" to exit
AbiWord:> convert /tmp/etherpad_export_2796631100.html /tmp/etherpad_export_2796631100.pdf pdf
AbiWord: could not open the file [/tmp/etherpad_export_2796631100.html]
error -1
AbiWord:> quit

$ tidy -xml /tmp/etherpad_export_2796631100.html
line 43 column 1 - Error: unexpected </head> in <meta>
line 125 column 43 - Warning: unescaped & which should be written as &amp;
line 127 column 16 - Warning: unescaped & which should be written as &amp;
line 253 column 6 - Warning: unescaped & which should be written as &amp;
line 374 column 1 - Error: unexpected </body> in <br>
line 375 column 1 - Error: unexpected </html> in <br>
Tidy found 3 warnings and 3 errors!

If I edit the HTML and replace <meta ....> with <meta .../> and change "& " to "&amp; " and <br> with <br/>, then AbiWord no longer has issues.

This is basically the same issue as #3732. I attached a sample intermediate HTML file to that issue.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

So… tidy was first written in 1994. What is considered valid HTML has changed a little bit in the last 30 years. (I’m old enough to remember.) It used to be considered normal - or at least okay - to leave ampersands unquoted if they weren’t part of a valid character reference.

I just read the HTML spec on this issue, (https://html.spec.whatwg.org/#character-reference-state) and honestly, I’m still not sure if having a “raw ampersand” that’s not part of a character reference is technically valid HTML. Various online sources say no, but my read is that scanning an ampersand should push the parser into “character reference state” but then scanning anything other than an ASCII alphanumeric or ‘#’ should reconsume in the ‘return state’ which I think means that it should work…? But the technicality doesn’t matter - it’s controversial enough that it shouldn’t be done by tidy, (IMHO) and it’s breaking PDF export in Etherpad.

There is an open issue in tidy, BTW: htacg/tidy-html5#1079

Thanks for the research. I checked the code it seems like tidy only runs for Abiword and SOffice and that causes more problems than needed. So I don’t really see a reason why we should keep tidy as is.