html-agility-pack: HtmlNode.InnerText Not working properly (JS get Working normally) maybe a bug?
Description
The Html source : https://auto.qq.com/a/20120202/000205.htm
YOU can use this code test
var source = File.ReadAllText(@"D:\Tmp\auto.qq.com-001.html", Encoding.UTF8);
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(source);
var content = htmlDocument.DocumentNode.SelectSingleNode("//div[contains(@class, 'bd')]");
var InnerText = content.InnerText;
Exception (HtmlAgilityPack.HtmlNode.innertext) like this img

//get InnerText not OK
var InnerText = content.InnerText;
Google Chrome console Js get is OK
- Working normally

Js Get
document.querySelector("#C-Main-Article-QQ > div.bd").innerText
pls help me
tks very much
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 17 (6 by maintainers)
Hello All,
The v1.11.12 has been released.
scriptandstyletext will only appear inInnerTextfromhead,scriptandstylenode.It fixes your issue @JustArchi , @cyotek
However, if we get some more error reported, we might just rollback all these changes or add an option to have the current behavior since this kind of change currently break some code which is not something we really love to do.
Let me know if everything now works as expected.
Hello all,
The v1.11.11 has been released.
Now only
scriptandstyleare ignored in the InnerTextLet me know if that version is working as expected.
yes
here can get
that’s a broken
Just a note that something broke for me in 1.11.10 related to InnerText.
Code:
That returns the expected text for the page title when run in 1.11.9, but an empty string in 1.11.10. 1.11.10 now puts the title in InnerHtml instead.
Likewise confirming that 1.11.12 is working fine for me.
Hello,
I just updated to 1.11.12 and can confirm non of my tests have failed so all seems to be well regarding the new build. Hopefully it also addresses the OP’s issue too!
Thanks again for the fast response and fix.
Regards; Richard Moss
I’m not any less confused than I was before, but I can confirm that 1.11.12 works again for my use cases, thank you 😅.
I removed the latest version from NuGet.
I will get it fixed on Monday.
Best Regards,
Jonathan
I can confirm what @Kinematics said above, 1.11.10 has fatal regression regarding
InnerTextand cases that worked fine previously no longer do. Please investigate.Hello @AtlantisDe ,
Thank you for reporting, we will look at it.
Best Regards,
Jonathan
Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);Entity Framework Extensions • Entity Framework Classic • Bulk Operations • Dapper PlusRuntime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3C# Eval Function • SQL Eval Function@JonathanMagnan I’ve tried latest 1.11.11 and it suffers from the same issue as .10.
In particular, I’m doing
InnerTextof//div[@class='pagecontent']/scriptin order to extract thescriptcontent for my usage. With last two releases it returnsnullthere.Maybe I’m doing something wrong or don’t understand an issue, but this used to work until now. Let me know if you need some reproducible case, but I’m pretty sure this will happen with any
InnerTextofscript. Alternatively, if this is intended then you should mark it with appropriate breaking change and offer proper rewrite, since personally I have no clue what I’m supposed to use instead.Hello @cyotek ,
Thank you for reporting,
It looks you somewhat are right. The text in the
scripttag can appear inheadInnerText but not in thebody. It’s not as simple as we show it or hide it… it depends on the parent tag.We will look at it this week and try to have it work as the browser does.
Apologies for replying to a closed issue. I’ve just updated to 1.11.11 and been bitten by this change with several tests failing, specifically in regards to the
styleelement returning an empty string forInnerText. I’ve worked around the issue by simply checking if the type of the child itself isTextand if so, readingInnerTextdirectly from the child. So, easy enough to work around but I was curious at the rational behind the change - I came across this issue whilst checking to see if anyone else had an issue or if I needed to file a new bug.I tested opening a page in Firefox where a
styleelement was present in aheadelement and then executingdocument.getElementsByTagName("head")[0].innerTextanddocument.getElementsByTagName("style")[0].innerTextin turn. The text for theheadincluded the title text and the style element content. The text forstyleincluded the style. This matches the behaviour of 1.11.8 (I never got around to updated to 1.11.9 or 1.11.10) but not the behaviour of 1.11.11.Next I tested another page which had JavaScript and doing
innerTexton thescriptobject returned the actual JavaScript. DoinginnerTexton the parent container did not include the JavaScript. I don’t have any tests which specifically examine the contents ofscripttags so I don’t know if this is matches the old HAP behaviour or not.I think therefore that potentially the new implementation is still flawed, at least in regards to
style, as callinginnerTextdirectly on ascriptorstyleelement in a browser console returns the content as expected. Calling it on a parent element containing either of these elements returns the CSS forstyleand nothing forscript.I tested this in Firefox 68.
Don’t know if this is useful information or not, but I’m going to revert back to 1.11.8 until I know if I really need to start examining
styleelements differently or not.Thanks; Richard Moss
Hello @AtlantisDe ,
The v1.11.10 has been released.
Could you try it and let me know if this issue is correctly fixed on your side.