jsoup: Websites with large amounts of data fail to parse.
Currently using Jsoup on some large websites, and it throws the Mark Invalid Exception which means the bufref is negative?
I tried using both Jsoup.connect(url).get() and Jsoup.connect(url).execute().parse() Both cause the same exception.
at org.jsoup.parser.CharacterReader.rewindToMark(CharacterReader.java:132)
at org.jsoup.parser.Tokeniser.consumeCharacterReference(Tokeniser.java:182)
at org.jsoup.parser.TokeniserState.readCharRef(TokeniserState.java:1698)
at org.jsoup.parser.TokeniserState.access$100(TokeniserState.java:8)
at org.jsoup.parser.TokeniserState$2.read(TokeniserState.java:36)
at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57)
at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:55)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:47)
at org.jsoup.parser.Parser.parseInput(Parser.java:35)
at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:835)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:285)```
if anybody would like to reproduce, here are some urls which it fails to parse:
https://www.spec.org/cpu2006/results/res2014q4/
https://www.spec.org/cpu2006/results/res2012q3/
https://www.spec.org/cpu2006/results/res2014q1/
https://www.spec.org/cpu2006/results/res2014q3/
https://www.spec.org/cpu2006/results/res2011q2/
https://www.spec.org/cpu2006/results/res2010q3/
https://www.spec.org/cpu2006/results/res2017q2/
https://www.spec.org/cpu2006/results/res2016q3/
https://www.spec.org/cpu2006/results/res2015q4/
https://www.spec.org/cpu2006/results/res2007q4/
https://www.spec.org/cpu2006/results/res2009q4/
https://www.spec.org/cpu2006/results/res2012q2/
https://www.spec.org/cpu2006/results/res2014q2/
https://www.spec.org/cpu2006/results/res2012q4/
https://www.spec.org/cpu2006/results/res2011q1/
Thanks
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 18 (13 by maintainers)
Commits related to this issue
- Fix Jsoup 'invalid mark issue' (https://github.com/jhy/jsoup/issues/1218) — committed to spypunk/sponge by spypunk 5 years ago
- Change how metals sonatype page is parsed Using Jsoup.connect and then Jsoup.parse to avoid issue https://github.com/jhy/jsoup/issues/1218 — committed to JesusMtnez/dotfiles by JesusMtnez 4 years ago
- Update KbServiceImpl.java jsoup 이슈(https://github.com/jhy/jsoup/issues/1218)로 인한 url 호출방식 변경 — committed to LucestDail/srtest by LucestDail 2 years ago
@btheu and others watching, jsoup 1.12.2 is available now. https://jsoup.org/news/release-1.12.2
Same here, version 1.12.1.
works every time, but
Jsoup.connect("https://www.spec.org/cpu2006/results/res2014q4/").get();andJsoup.connect("https://www.spec.org/cpu2006/results/res2014q4/").execute().parse();fail every time with:Opened #1242 for another stab at fixing this issue. It’s built upon the branch I mentioned in my previous comment, the one that adds unit tests demonstrating some other edge case failures.