html-agility-pack: HtmlNode.InnerText Not working properly (JS get Working normally) maybe a bug?
Description
The Html source : https://auto.qq.com/a/20120202/000205.htm
YOU can use this code test
var source = File.ReadAllText(@"D:\Tmp\auto.qq.com-001.html", Encoding.UTF8);
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(source);
var content = htmlDocument.DocumentNode.SelectSingleNode("//div[contains(@class, 'bd')]");
var InnerText = content.InnerText;
Exception (HtmlAgilityPack.HtmlNode.innertext) like this img
//get InnerText not OK
var InnerText = content.InnerText;
Google Chrome console Js get is OK
- Working normally
Js Get
document.querySelector("#C-Main-Article-QQ > div.bd").innerText
pls help me
tks very much
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 17 (6 by maintainers)
Hello All,
The v1.11.12 has been released.
script
andstyle
text will only appear inInnerText
fromhead
,script
andstyle
node.It fixes your issue @JustArchi , @cyotek
However, if we get some more error reported, we might just rollback all these changes or add an option to have the current behavior since this kind of change currently break some code which is not something we really love to do.
Let me know if everything now works as expected.
Hello all,
The v1.11.11 has been released.
Now only
script
andstyle
are ignored in the InnerTextLet me know if that version is working as expected.
yes
here can get
that’s a broken
Just a note that something broke for me in 1.11.10 related to InnerText.
Code:
That returns the expected text for the page title when run in 1.11.9, but an empty string in 1.11.10. 1.11.10 now puts the title in InnerHtml instead.
Likewise confirming that 1.11.12 is working fine for me.
Hello,
I just updated to 1.11.12 and can confirm non of my tests have failed so all seems to be well regarding the new build. Hopefully it also addresses the OP’s issue too!
Thanks again for the fast response and fix.
Regards; Richard Moss
I’m not any less confused than I was before, but I can confirm that 1.11.12 works again for my use cases, thank you 😅.
I removed the latest version from NuGet.
I will get it fixed on Monday.
Best Regards,
Jonathan
I can confirm what @Kinematics said above, 1.11.10 has fatal regression regarding
InnerText
and cases that worked fine previously no longer do. Please investigate.Hello @AtlantisDe ,
Thank you for reporting, we will look at it.
Best Regards,
Jonathan
Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework Extensions • Entity Framework Classic • Bulk Operations • Dapper PlusRuntime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval Function • SQL Eval Function@JonathanMagnan I’ve tried latest 1.11.11 and it suffers from the same issue as .10.
In particular, I’m doing
InnerText
of//div[@class='pagecontent']/script
in order to extract thescript
content for my usage. With last two releases it returnsnull
there.Maybe I’m doing something wrong or don’t understand an issue, but this used to work until now. Let me know if you need some reproducible case, but I’m pretty sure this will happen with any
InnerText
ofscript
. Alternatively, if this is intended then you should mark it with appropriate breaking change and offer proper rewrite, since personally I have no clue what I’m supposed to use instead.Hello @cyotek ,
Thank you for reporting,
It looks you somewhat are right. The text in the
script
tag can appear inhead
InnerText but not in thebody
. It’s not as simple as we show it or hide it… it depends on the parent tag.We will look at it this week and try to have it work as the browser does.
Apologies for replying to a closed issue. I’ve just updated to 1.11.11 and been bitten by this change with several tests failing, specifically in regards to the
style
element returning an empty string forInnerText
. I’ve worked around the issue by simply checking if the type of the child itself isText
and if so, readingInnerText
directly from the child. So, easy enough to work around but I was curious at the rational behind the change - I came across this issue whilst checking to see if anyone else had an issue or if I needed to file a new bug.I tested opening a page in Firefox where a
style
element was present in ahead
element and then executingdocument.getElementsByTagName("head")[0].innerText
anddocument.getElementsByTagName("style")[0].innerText
in turn. The text for thehead
included the title text and the style element content. The text forstyle
included the style. This matches the behaviour of 1.11.8 (I never got around to updated to 1.11.9 or 1.11.10) but not the behaviour of 1.11.11.Next I tested another page which had JavaScript and doing
innerText
on thescript
object returned the actual JavaScript. DoinginnerText
on the parent container did not include the JavaScript. I don’t have any tests which specifically examine the contents ofscript
tags so I don’t know if this is matches the old HAP behaviour or not.I think therefore that potentially the new implementation is still flawed, at least in regards to
style
, as callinginnerText
directly on ascript
orstyle
element in a browser console returns the content as expected. Calling it on a parent element containing either of these elements returns the CSS forstyle
and nothing forscript
.I tested this in Firefox 68.
Don’t know if this is useful information or not, but I’m going to revert back to 1.11.8 until I know if I really need to start examining
style
elements differently or not.Thanks; Richard Moss
Hello @AtlantisDe ,
The v1.11.10 has been released.
Could you try it and let me know if this issue is correctly fixed on your side.