AngleSharp: Unable to remove attributes from element whose name has accents from different culture

The following code snippet will result in an infinite loop

HtmlParser parser = new HtmlParser();
var document = parser.Parse("<TAG MÍ=\"\" />");
var element = (IElement)document.Body.FirstChild;
while (element.Attributes.Length > 0)
{
    element.RemoveAttribute(element.Attributes[0].Name);
}

I didn’t have time to investigate deeply, but I noticed in Element.cs there’s two calls to ToLower and two to ToLowerInvariant. I don’t know any Spanish, but I think the Í character is capital, so my guess is that some parts of the code is converting it to lowercase but some parts are not. What particularly surprises me is that no exception is being thrown, as I see what appears to be some error checking in NamedNodeMap.

Finally, thanks for a really great library. It’s been really helpful.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 17 (10 by maintainers)

Commits related to this issue

Added test to verify #419 — committed to FlorianRappl/AngleSharp by FlorianRappl 8 years ago
Special HtmlLower() to follow specification fixes #419 — committed to FlorianRappl/AngleSharp by FlorianRappl 8 years ago
Merge pull request #421 from FlorianRappl/issue-#419 Issue #419 — committed to AngleSharp/AngleSharp by FlorianRappl 8 years ago

Most upvoted comments

I now know the source of the error. It is tricky …

I tried running this in the browser

<!DOCTYPE html>
<html>

<body>
<TAG MÍ="" />
<script>
var element = document.body.firstElementChild;
var name = element.attributes[0].name;
alert(name);
alert(element.attributes.length);
element.removeAttribute(name);
alert(element.attributes.length);
</script>
</body>

</html>

and apparently the result is mÍ / 1 / 1 with Chromium (e.g., Chrome, Opera). However, IE and Firefox seem to handle this well (we get … / 0).

HTML defines what is considered uppercase for the tokenizer (on a char-by-char basis), but does not really refer to this range for the DOM model. It is now clear, however, that the same rules need to be applied (should have been obvious, but I seemed to miss that connection).

In my naive way of thinking I used the ToLower (or ToLowerInvariant) methods from .NET here, but I need to provide my own extension method following the W3C specification for the definition of “what is an upper char”.

I also filed a bug for Chromium (see https://bugs.chromium.org/p/chromium/issues/detail?id=651946). I don’t know if this will be fixed there.

FlorianRappl on Sep 30, 2016

Yes, method calls are fast, but C# is an indirect language. Hence you’ll pay for every reference by a potential cache miss. Of course, the CLR does a pretty good job on hiding / minimizing this, but in hot paths (and the attribute one is certainly such a hot path, e.g., for QuerySelector) this may add up. I am not sure and the architectural elegance may in the end just outweigh the performance savings (for most code this is certainly the case).

That’s why I always run the performance benchmarks before/after such a change (i.e., I am not sure about the impact; all I did was explaining my fears / another arguments against that change - which would indeed by more elegant).

Long story short: Thanks for your input; even though it potentially is not applicable in the described scenario it is certainly helpful in others.

FlorianRappl on Sep 30, 2016

Here ToLower is used to accord to the specification. Attributes that are within the HTML namespace need to be transformed to lower-case. Therefore, if querying against HTML attributes the generic comparison algorithm is used (case-sensitive), with an input parameter unknown of the namespace. Thus, if the incoming namespace is HTML as well, the transformation takes care of comparing the right things.

So what I referred to is that we do not use ToLower to fake some case-insensitive comparison, but rather to follow the specification. The spec. dictates that HTML attributes are only stored lower case. It also dictates that (only) HTML attributes are compared in a case-insensitive way. So the same transformation that is applied for storing, needs to be applied to the input to compare the right things.

If we would only care about HTML then I would totally feel fine about only using the comparison (however, serialization would then have to use ToLower again - quite ugly), but some elements may be SVG, MathML, or even elements from some other namespace. AngleSharp carefully follows the spec. here, meaning the transformation was a necessary evil.

I’ll see what I can do (the invariant path is the one I want to take). Maybe there is a more efficient way to solve this (e.g., boolean flag to determine if case-insensitive comparison should be used).

FlorianRappl on Sep 30, 2016