Lexos: Remove All Punctuation on scrub page causing problems related to tags
Please give us some of the following information, That will be really helpful to track the bug:
- if you are proposing an enhancement, please delete all of these
which option did I check
Remove All Punctuation and Remove Tags(probably should be called handle tag)
which file did I use
Gen AB
what the bug looks like
the Remove All Punctuation removes punctuation within the tag, therefore damage the following info in the tag:
- the closing tag (
</a>turns into<a>after scrub) - the attribute in the tag (
<a name='test'>turns into<a nametest>after scrub) - even the tag name itself. (
<tei.2>turns into<tei>after scrub)
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 22 (22 by maintainers)
Commits related to this issue
- fix the issue #393 for now, all the tags will not be harmed by remove punctuation and remove digits (but will turn into lower case, and that is okay to me.) this method is takes only 0.06 sec to scr... — committed to czhang03/Lexos by deleted user 8 years ago
- Merge pull request #394 from chantisnake/master fix the issue #393 — committed to WheatonCS/Lexos by deleted user 8 years ago
- improve the speed of scrub significantly, see my comment on #393 and keeps the case(don't turn lower) in xml tag, because xml is case sensitive — committed to czhang03/Lexos by deleted user 8 years ago
- improve the speed of scrub significantly, see my comment on #393 and keeps the case(don't turn lower) in xml tag, because xml is case sensitive — committed to czhang03/Lexos by deleted user 8 years ago
I improve the function to be 10 times faster than the old version!!! which handles GenAB in 0.06 sec!!!
Cheers!
But there maybe bugs = =, I will test this more thoroughly tonight