jodd: java.lang.IndexOutOfBoundsException for 27 pages - tested 862_674 pages.
Current behavior
I got java.lang.IndexOutOfBoundsException while parsing 27 pages!
I use:
<dependency>
<groupId>org.jodd</groupId>
<artifactId>jodd-lagarto</artifactId>
<version>5.1.5</version>
</dependency>
Exception that I have:
java.lang.IndexOutOfBoundsException
Expected behavior
No exception.
Steps to Reproduce the Problem
My code:
@Override
public List<CharSequence> parse(CharSequence html) {
LagartoParser lagartoParser = new LagartoParser(html.toString());
LagartoParserConfig config = new LagartoParserConfig();
config.setEnableConditionalComments(false); // <-----
config.setEnableRawTextModes(false); // <-----
lagartoParser.setConfig(config);
TagVisitorImpl tagVisitor = new TagVisitorImpl();
lagartoParser.parse(tagVisitor);
return tagVisitor.getLinks();
}
class TagVisitorImpl implements TagVisitor {
@Override
public void tag(Tag tag) {
href = tag.getAttributeValue("href");
if (href != null) {
// ...
}
}
Pages that I parsed:
https://www.dropbox.com/sh/279el3cheql3esc/AAAc-btAJgyWUF89fTJ_1cTPa?dl=0
How I found it ? I found my old zip file with 1 million downloaded pages (ok, is it 862_674) and I parsed it. I will try to push it to my github… the zip file is 13GB big, it should be possible (?) I will let you know later. But you have all failing pages, so you can start fixing it.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 42 (34 by maintainers)
Commits related to this issue
- Add new test for #768 — committed to oblac/jodd by igr 4 years ago
Note: that’s the same
<!--->pattern in all of the reported files.I will migrate from Godday on the next payment cycle, too much for me atm…
@neroux makes sense, https://lagarto.jodd.org it is.
I will try to add Reader as an input, and then I will release lagarto to 6.
For now, I will update this repo and current docs.
@igr I run test aginst all these pages and there were no exceptions. I run the newest snapshot version
Congratulations, you’ve fixed issue 😉
I want to upload all html pages that I used in the test. Unfortunately, it looks like Github has limit of 5GB per repository - I have 13GB. I will upload it to S3, but I need to return from travel - around 1 week.
I downloaded all the html and in http-client set by default UTF-8 encoding. In the specification I saw that finding a proper encoding is tricky. Can you recommend some library to do it?
Here’s a simple reproducer for the first page:
Please note the comment block is not closed in
<!--->as there’s only 3 dashes (and not 4).