jsoup: Websites with large amounts of data fail to parse.

Currently using Jsoup on some large websites, and it throws the Mark Invalid Exception which means the bufref is negative?

I tried using both Jsoup.connect(url).get() and Jsoup.connect(url).execute().parse() Both cause the same exception.

	at org.jsoup.parser.CharacterReader.rewindToMark(CharacterReader.java:132)
	at org.jsoup.parser.Tokeniser.consumeCharacterReference(Tokeniser.java:182)
	at org.jsoup.parser.TokeniserState.readCharRef(TokeniserState.java:1698)
	at org.jsoup.parser.TokeniserState.access$100(TokeniserState.java:8)
	at org.jsoup.parser.TokeniserState$2.read(TokeniserState.java:36)
	at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57)
	at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:55)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:47)
	at org.jsoup.parser.Parser.parseInput(Parser.java:35)
	at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
	at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:835)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:285)```

if anybody would like to reproduce, here are some urls which it fails to parse:
https://www.spec.org/cpu2006/results/res2014q4/
https://www.spec.org/cpu2006/results/res2012q3/
https://www.spec.org/cpu2006/results/res2014q1/
https://www.spec.org/cpu2006/results/res2014q3/
https://www.spec.org/cpu2006/results/res2011q2/
https://www.spec.org/cpu2006/results/res2010q3/
https://www.spec.org/cpu2006/results/res2017q2/
https://www.spec.org/cpu2006/results/res2016q3/
https://www.spec.org/cpu2006/results/res2015q4/
https://www.spec.org/cpu2006/results/res2007q4/
https://www.spec.org/cpu2006/results/res2009q4/
https://www.spec.org/cpu2006/results/res2012q2/
https://www.spec.org/cpu2006/results/res2014q2/
https://www.spec.org/cpu2006/results/res2012q4/
https://www.spec.org/cpu2006/results/res2011q1/


Thanks

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 18 (13 by maintainers)

Commits related to this issue

Most upvoted comments

@btheu and others watching, jsoup 1.12.2 is available now. https://jsoup.org/news/release-1.12.2

Same here, version 1.12.1.

Response execute = Jsoup.connect("https://www.spec.org/cpu2006/results/res2014q4/").execute();
Jsoup.parse(execute.body());

works every time, but Jsoup.connect("https://www.spec.org/cpu2006/results/res2014q4/").get(); and Jsoup.connect("https://www.spec.org/cpu2006/results/res2014q4/").execute().parse(); fail every time with:

Exception in thread "main" java.io.IOException: Mark invalid
	at org.jsoup.parser.CharacterReader.rewindToMark(CharacterReader.java:132)
	at org.jsoup.parser.Tokeniser.consumeCharacterReference(Tokeniser.java:182)
	at org.jsoup.parser.TokeniserState.readCharRef(TokeniserState.java:1698)
	at org.jsoup.parser.TokeniserState.access$100(TokeniserState.java:8)
	at org.jsoup.parser.TokeniserState$2.read(TokeniserState.java:36)
	at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57)
	at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:55)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:47)
	at org.jsoup.parser.Parser.parseInput(Parser.java:35)
	at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
	at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:835)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:285)
	at JsoupIssue1218.main(JsoupIssue1218.java:7)

Opened #1242 for another stab at fixing this issue. It’s built upon the branch I mentioned in my previous comment, the one that adds unit tests demonstrating some other edge case failures.