html-agility-pack: HtmlNode.InnerText Not working properly (JS get Working normally) maybe a bug?

Description

The Html source : https://auto.qq.com/a/20120202/000205.htm

YOU can use this code test

var source = File.ReadAllText(@"D:\Tmp\auto.qq.com-001.html", Encoding.UTF8);
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(source);
var content = htmlDocument.DocumentNode.SelectSingleNode("//div[contains(@class, 'bd')]");
var InnerText = content.InnerText;

Exception (HtmlAgilityPack.HtmlNode.innertext) like this img

content.InnerText

//get InnerText not OK
var InnerText = content.InnerText;

Google Chrome console Js get is OK

  • Working normally avatar

Js Get

document.querySelector("#C-Main-Article-QQ > div.bd").innerText

pls help me

tks very much 

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 17 (6 by maintainers)

Commits related to this issue

Most upvoted comments

Hello All,

The v1.11.12 has been released.

script and style text will only appear in InnerText from head, script and style node.

It fixes your issue @JustArchi , @cyotek

However, if we get some more error reported, we might just rollback all these changes or add an option to have the current behavior since this kind of change currently break some code which is not something we really love to do.

Let me know if everything now works as expected.

Hello all,

The v1.11.11 has been released.

Now only script and style are ignored in the InnerText

Let me know if that version is working as expected.

Just a note that something broke for me in 1.11.10 related to InnerText.

Code:

doc.Element("html").Element("head")?.Element("title")?.InnerText

That returns the expected text for the page title when run in 1.11.9, but an empty string in 1.11.10. 1.11.10 now puts the title in InnerHtml instead.

yes

here can get

var title_1 = htmlDocument.DocumentNode.Element("html").Element("head").Element("title").InnerHtml;

that’s a broken

var title_2 = htmlDocument.DocumentNode.Element("html").Element("head").Element("title").InnerText;

Just a note that something broke for me in 1.11.10 related to InnerText.

Code:

doc.Element("html").Element("head")?.Element("title")?.InnerText

That returns the expected text for the page title when run in 1.11.9, but an empty string in 1.11.10. 1.11.10 now puts the title in InnerHtml instead.

Likewise confirming that 1.11.12 is working fine for me.

Hello,

I just updated to 1.11.12 and can confirm non of my tests have failed so all seems to be well regarding the new build. Hopefully it also addresses the OP’s issue too!

Thanks again for the fast response and fix.

Regards; Richard Moss

I’m not any less confused than I was before, but I can confirm that 1.11.12 works again for my use cases, thank you 😅.

I removed the latest version from NuGet.

I will get it fixed on Monday.

Best Regards,

Jonathan

I can confirm what @Kinematics said above, 1.11.10 has fatal regression regarding InnerText and cases that worked fine previously no longer do. Please investigate.

Hello @AtlantisDe ,

Thank you for reporting, we will look at it.

Best Regards,

Jonathan


Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval FunctionSQL Eval Function

@JonathanMagnan I’ve tried latest 1.11.11 and it suffers from the same issue as .10.

In particular, I’m doing InnerText of //div[@class='pagecontent']/script in order to extract the script content for my usage. With last two releases it returns null there.

Maybe I’m doing something wrong or don’t understand an issue, but this used to work until now. Let me know if you need some reproducible case, but I’m pretty sure this will happen with any InnerText of script. Alternatively, if this is intended then you should mark it with appropriate breaking change and offer proper rewrite, since personally I have no clue what I’m supposed to use instead.

Hello @cyotek ,

Thank you for reporting,

It looks you somewhat are right. The text in the script tag can appear in head InnerText but not in the body. It’s not as simple as we show it or hide it… it depends on the parent tag.

We will look at it this week and try to have it work as the browser does.

Apologies for replying to a closed issue. I’ve just updated to 1.11.11 and been bitten by this change with several tests failing, specifically in regards to the style element returning an empty string for InnerText. I’ve worked around the issue by simply checking if the type of the child itself is Text and if so, reading InnerText directly from the child. So, easy enough to work around but I was curious at the rational behind the change - I came across this issue whilst checking to see if anyone else had an issue or if I needed to file a new bug.

I tested opening a page in Firefox where a style element was present in a head element and then executing document.getElementsByTagName("head")[0].innerText and document.getElementsByTagName("style")[0].innerText in turn. The text for the head included the title text and the style element content. The text for style included the style. This matches the behaviour of 1.11.8 (I never got around to updated to 1.11.9 or 1.11.10) but not the behaviour of 1.11.11.

Next I tested another page which had JavaScript and doing innerText on the script object returned the actual JavaScript. Doing innerText on the parent container did not include the JavaScript. I don’t have any tests which specifically examine the contents of script tags so I don’t know if this is matches the old HAP behaviour or not.

I think therefore that potentially the new implementation is still flawed, at least in regards to style, as calling innerText directly on a script or style element in a browser console returns the content as expected. Calling it on a parent element containing either of these elements returns the CSS for style and nothing for script.

I tested this in Firefox 68.

Don’t know if this is useful information or not, but I’m going to revert back to 1.11.8 until I know if I really need to start examining style elements differently or not.

Thanks; Richard Moss

Hello @AtlantisDe ,

The v1.11.10 has been released.

Could you try it and let me know if this issue is correctly fixed on your side.