runtime: String.GetHashCode() consistently slower on dotnetcore than .Net Framework
Using this PowerShell script on Win10 17713
$totalms = 0
1..10 | % { $totalms += (measure-command { foreach($i in 1..10000000) {$Null = "PowerShell6HasSlowerHashCode".GetHashCode()} }).totalmilliseconds }
$totalms/10
Computes the hash for the string 10M times and averages the time over 10 runs.
On .Net 4.7.2 it takes 12455ms Same machine with PSCore6.1 w/ dotnetcore 2.1.301 17579ms
This is causing a perf regression in PowerShell due to use of GetHashCode() calls.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 49 (45 by maintainers)
new Dictionary<string, ...>()now has the optimization which optimistically uses the “fast” hash code calculation routine, falling back to the randomized hash code calculation routine only when collisions are encountered. We also improved the performance of Marvin32 in the .NET Core 3.0 timeframe (see https://github.com/dotnet/coreclr/pull/22816).Additionally, per https://github.com/PowerShell/PowerShell/issues/7112, it looks like the original scenario which precipitated the opening of this issue has since been resolved in PowerShell Core.
Since it looks like there’s nothing left to do on the runtime side of things I’m closing the issue for now. Please reopen this issue if needed. Thanks!
My understanding of things is that randomized hash codes is to protect against a denial of service attack where you don’t trust your user. While there are instances where this is very important (a service open to all), in many cases it is just not interesting because you trust the user (if your users alreayd ‘own’ the machine (e.g. this is an ‘ordinary’ app), then you don’t care (they are creating a DOS against themselves).
Another scneario is simply compat (my software took a deep dependency, and I need something! This case is probably where @mjsabby is coming from. Here you are simply willing to trade off security (often because it was no worse than what it is replacing).
So I think there are useful scenarios for an opt-out. (This came up in the context of Powershell, and it is an intersting question whether they should opt-out). One could argue that this is what the do defacto on the desktop runtime (which is most of their user base today.
So far, it seems that there are better places to look for perf, and it can be mititgated in important cases like what @benaadams did. Until someone actually does the PR for the opt-out, the issue won’t be forced…
(security hat on)
I don’t like the idea of adding an application-wide opt-out switch for the randomized hash code calculation. There’s a lot of code in the framework and in third-party libraries that either explicitly or implicitly depends on this behavior for its security guarantees. Consider
Dictionary<Uri, ...>or some otherTKeywhere an untrusted string instance is somehow worked into the hash code calculation. (In fact, we now make this very easy to do with the introduction of theHashCodetype!)Making it per-collection (such as via an explicit
StringComparerinstance passed in to the collection’s ctor) has the benefit that the developer is saying “I know this particular usage of non-randomized hash codes is safe,” and this statement is made at collection construction time. Contrast this with an application-wide switch, where the developer making that statement probably hasn’t taken the time to analyze their full dependency graph and validate that each dependency is safe in the face of predictable hash codes.Having collections optimistically use a non-randomizing hash (for
TKey = string) and fall back to a randomizing hash in the face of collisions is also a viable option, as it’s again a per-collection decision rather than an application-wide setting. It means that scenarios likeTKey = System.Uriaren’t optimized and continue to use randomizing hash code calculations, but this makes them safe by default, which is goodness.I want to summarize my understanding of where we are.
The next step is for me to look at the traces. I will do this later today…
First, it is naive to think that everything can be made faster. Many key perf benefits are usually trade-offs - some scenarios get faster at the cost of other scenarios. Of course, we’re trying hard to focus on the most important scenarios to get faster, while not making other important scenarios (much) slower. Obviously sometimes perf improvements can bring unexpected perf regressions in other scenarios. Sometimes we may misjudge importance of certain scenarios. And sometimes some products just use weird unique patterns which we don’t consider important, but they hurt specific products (in which case we often recommend the products to migrate to better and more perfmant APIs). Again, not everything can be made just better at all times. Expecting that everything is just better, not matter what you use and how, is unrealistic in any large mature project or platform.
With the previous paragraph in mind, the best next step is to investigate the perf difference and find the root-cause. Is PS using some weird pattern, which can be replaced with better one? Or did we truly accidentally hurt perf of a scenario which is important for more than PS? (BTW: so far only PS reported it)
Agreed that it is good. However, I also want to clarify the expectations in general: We often do not scale to jump on every single report of allegedly slower perf or functional regression, especially when there is lots of in-the-middle code involved (like PS) - mainly because such investigations require often understanding of the higher-level framework code base to be done efficiently and to figure out where the perf fix could/should be (which may NOT be in .NET Core layer). That’s why we often ask for repros without in-the-middle frameworks (like PS), or we ask users/higher-level frameworks to do first-level analysis (ideally leading to isolated repro without the higher-level framework involved at all - the original report in this issue is great example of that). Obviously we are more attentive to problems which have larger impact - signs to be hitting multiple customers across multiple higher-level frameworks - those are likely to be unintended regressions we caused by accident or by mistake in our layer.
It would be best to support it with data by showing how much it could help end-to-end scenarios (e.g. PS ones). It will help us prioritize such improvements against other bugs and enhancements. It is unfortunate that this issue became mix up of multiple problems - the fact that GetHashCode became slower (partially by design with potential room for improvements via vectorizing) vs. end-to-end PS scenario is significantly slower (without clear root-cause at this point).
Given the mix up on this issue, I would recommend to close it and file separate issue for the potential GetHashCode improvement via vecotrization (with data how much it contributes to end-to-end scenarios like PS). We can continue root-causing the “general perf regression” either in PS issue, or a new one in CoreFX and see what it turns into.
We have done that already. There are no major bottlenecks that we are aware of. We believe that the current implementation of GetHashCode is the right choice for .NET Core even though it is slower to compute. Note that it produces much higher quality hashcodes and it is not hard to come up with examples where it makes the end-to-end execution faster because of reduced hash collisions.
If the .NET Core string hash function does not work well in some cases, our recommendation is to replace it with a custom hash code function specifically designed for the given case. For example, Roslyn has their own implementation of string hash function that works well for names used as C# identifiers.
@stephentoub The randomized seed should only introduce a small initial setup cost. I don’t think it can impact GetHashCode() this much. There were, however, a few performance regressions in the Marvin hash that were attributed to the removal of a couple of AggressiveInlining attributes, but I don’t seem to remember that all the regressions were found and fixed.
@vancem
After that “Thus .NET Core is time is 1.23X the Destkop time (23% slower).” I believe that developers will postpone migration to .Net Core - this is too noticeable a slowdown. I expect that .Net Core team will mobilize and make a significant investment to find bottlenecks and to remedy them. Just this issue shows a two-fold slowdown in GetHashCode(). Perhaps this is not a key issue for PowerShell but it is definitely a problem for .Net Core as a whole.
@vancem
Many thanks! The issue is for tracking GetHashCode() - I suggest move to https://github.com/PowerShell/PowerShell/issues/7112
If the code has not changed but it works more slowly twice (I mean cmdlets), then it is more necessary to look into .Net Core.
@danmosemsft This issue is talking about GetHashCode() being slower on .NET Core vs .NET Framework, the PowerShell issue has shed some light onto it but I think it is independent of this. It may very well be that the PowerShell issue is a set of issues but I would think we want to improve GetHashCode on .NET Core, both by improving the code quality and in my opinion also providing the ability to quirk it off if needed.
In that sense I think closing this issue is premature.
In the original issue opened by @iSazonov, he noticed that
Import-CSVwas about 2x slower for a large (>100MB) csv file. However, we are talking about ~3.7s to ~6s in absolute terms. GetHashCode() is used quite a bit in PowerShell code base. However, I can certainly see the security benefit. @iSazonov are you ok with the perf degradation for your specific use case? If we find more common scenarios with significant impact, we should revisit this.