Lexos: Remove All Punctuation on scrub page causing problems related to tags

Please give us some of the following information, That will be really helpful to track the bug:

  • if you are proposing an enhancement, please delete all of these

which option did I check

Remove All Punctuation and Remove Tags(probably should be called handle tag)

which file did I use

Gen AB

what the bug looks like

the Remove All Punctuation removes punctuation within the tag, therefore damage the following info in the tag:

  • the closing tag (</a> turns into <a> after scrub)
  • the attribute in the tag (<a name='test'> turns into <a nametest> after scrub)
  • even the tag name itself. (<tei.2> turns into <tei> after scrub)

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 22 (22 by maintainers)

Commits related to this issue

Most upvoted comments

I improve the function to be 10 times faster than the old version!!! which handles GenAB in 0.06 sec!!!

Cheers!

But there maybe bugs = =, I will test this more thoroughly tonight