jodd: java.lang.IndexOutOfBoundsException for 27 pages - tested 862_674 pages.

Current behavior

I got java.lang.IndexOutOfBoundsException while parsing 27 pages!

I use:

    <dependency>
      <groupId>org.jodd</groupId>
      <artifactId>jodd-lagarto</artifactId>
      <version>5.1.5</version>   
    </dependency>

Exception that I have:

java.lang.IndexOutOfBoundsException

Expected behavior

No exception.

Steps to Reproduce the Problem

My code:

    @Override
    public List<CharSequence> parse(CharSequence html) {
        LagartoParser lagartoParser = new LagartoParser(html.toString());
        LagartoParserConfig config = new LagartoParserConfig();
        config.setEnableConditionalComments(false);  // <----- 
        config.setEnableRawTextModes(false);             // <----- 
        lagartoParser.setConfig(config);
        TagVisitorImpl tagVisitor = new TagVisitorImpl();
        lagartoParser.parse(tagVisitor);
        return tagVisitor.getLinks();
    }

        class TagVisitorImpl implements TagVisitor {
        @Override
        public void tag(Tag tag) {
            href = tag.getAttributeValue("href");
            if (href != null) {
                // ... 
            }
        }

Pages that I parsed:

https://www.dropbox.com/sh/279el3cheql3esc/AAAc-btAJgyWUF89fTJ_1cTPa?dl=0

How I found it ? I found my old zip file with 1 million downloaded pages (ok, is it 862_674) and I parsed it. I will try to push it to my github… the zip file is 13GB big, it should be possible (?) I will let you know later. But you have all failing pages, so you can start fixing it.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 42 (34 by maintainers)

Commits related to this issue

Most upvoted comments

Note: that’s the same <!---> pattern in all of the reported files.

I will migrate from Godday on the next payment cycle, too much for me atm…

@neroux makes sense, https://lagarto.jodd.org it is.

I will try to add Reader as an input, and then I will release lagarto to 6.

For now, I will update this repo and current docs.

@igr I run test aginst all these pages and there were no exceptions. I run the newest snapshot version

    <repositories>
        <repository>
            <id>maven-snapshots</id>
            <url>http://oss.sonatype.org/content/repositories/snapshots</url>
            <layout>default</layout>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>
        <dependency>
            <groupId>org.jodd</groupId>
            <artifactId>jodd-lagarto</artifactId>
            <version>6.0.0.20200731131821-SNAPSHOT</version>
            <scope>test</scope>
        </dependency>

Congratulations, you’ve fixed issue 😉

I want to upload all html pages that I used in the test. Unfortunately, it looks like Github has limit of 5GB per repository - I have 13GB. I will upload it to S3, but I need to return from travel - around 1 week.

I downloaded all the html and in http-client set by default UTF-8 encoding. In the specification I saw that finding a proper encoding is tricky. Can you recommend some library to do it?

Here’s a simple reproducer for the first page:

<html>
<body>
<!--->
-->
</body>
</html>

Please note the comment block is not closed in <!---> as there’s only 3 dashes (and not 4).