runtime: String.GetHashCode() consistently slower on dotnetcore than .Net Framework

Using this PowerShell script on Win10 17713

$totalms = 0
1..10 | % { $totalms += (measure-command { foreach($i in 1..10000000) {$Null = "PowerShell6HasSlowerHashCode".GetHashCode()} }).totalmilliseconds }
$totalms/10

Computes the hash for the string 10M times and averages the time over 10 runs.

On .Net 4.7.2 it takes 12455ms Same machine with PSCore6.1 w/ dotnetcore 2.1.301 17579ms

This is causing a perf regression in PowerShell due to use of GetHashCode() calls.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 49 (45 by maintainers)

Most upvoted comments

new Dictionary<string, ...>() now has the optimization which optimistically uses the “fast” hash code calculation routine, falling back to the randomized hash code calculation routine only when collisions are encountered. We also improved the performance of Marvin32 in the .NET Core 3.0 timeframe (see https://github.com/dotnet/coreclr/pull/22816).

Additionally, per https://github.com/PowerShell/PowerShell/issues/7112, it looks like the original scenario which precipitated the opening of this issue has since been resolved in PowerShell Core.

Since it looks like there’s nothing left to do on the runtime side of things I’m closing the issue for now. Please reopen this issue if needed. Thanks!

My understanding of things is that randomized hash codes is to protect against a denial of service attack where you don’t trust your user. While there are instances where this is very important (a service open to all), in many cases it is just not interesting because you trust the user (if your users alreayd ‘own’ the machine (e.g. this is an ‘ordinary’ app), then you don’t care (they are creating a DOS against themselves).

Another scneario is simply compat (my software took a deep dependency, and I need something! This case is probably where @mjsabby is coming from. Here you are simply willing to trade off security (often because it was no worse than what it is replacing).

So I think there are useful scenarios for an opt-out. (This came up in the context of Powershell, and it is an intersting question whether they should opt-out). One could argue that this is what the do defacto on the desktop runtime (which is most of their user base today.

So far, it seems that there are better places to look for perf, and it can be mititgated in important cases like what @benaadams did. Until someone actually does the PR for the opt-out, the issue won’t be forced…

(security hat on)

I don’t like the idea of adding an application-wide opt-out switch for the randomized hash code calculation. There’s a lot of code in the framework and in third-party libraries that either explicitly or implicitly depends on this behavior for its security guarantees. Consider Dictionary<Uri, ...> or some other TKey where an untrusted string instance is somehow worked into the hash code calculation. (In fact, we now make this very easy to do with the introduction of the HashCode type!)

Making it per-collection (such as via an explicit StringComparer instance passed in to the collection’s ctor) has the benefit that the developer is saying “I know this particular usage of non-randomized hash codes is safe,” and this statement is made at collection construction time. Contrast this with an application-wide switch, where the developer making that statement probably hasn’t taken the time to analyze their full dependency graph and validate that each dependency is safe in the face of predictable hash codes.

Having collections optimistically use a non-randomizing hash (for TKey = string) and fall back to a randomizing hash in the face of collisions is also a viable option, as it’s again a per-collection decision rather than an application-wide setting. It means that scenarios like TKey = System.Uri aren’t optimized and continue to use randomizing hash code calculations, but this makes them safe by default, which is goodness.

I want to summarize my understanding of where we are.

  1. We understand that .NET Core has a different String.GetHashCode, and that it is modestly (23%) slower in a microbenchmark.
  2. There is a Powershell scenario (csv processing) that is reported to be 40% slower = 1.4X (17579ms / 12455ms), however looking at the profile shows that String.GetHashCode CANNOT be the issue (It only consumes 1% of the total CPU time).
  3. Thus this issue is at best poorly named since it is about String.GetHashCode. We are likely to close it. We are leaving it open right now just to track the investigation of the 40% slowdown, but shortly we are likely to transition to another issue one way or the other.
  4. The .NET Runtime team wants to make sure that .NET Core is not noticably slower (preferably faster), in all intersting scenarios. THis is why we are looking at the regression. However this is almost certainly not going to force a change to String.GetHashCode (we may yet do that, but we want to see real scenarios that are impacted by the know 23% regression in GetHashCode first).

The next step is for me to look at the traces. I will do this later today…

But when I run the same code on .Net Core and see that it’s noticeably slower and GC is working worse what conclusion should I make? Wait .Net Core 4.7 version 10 years later? I expect that .Net Core app will work better than .Net Framework, and if it is worse, then not more than a few percent.

First, it is naive to think that everything can be made faster. Many key perf benefits are usually trade-offs - some scenarios get faster at the cost of other scenarios. Of course, we’re trying hard to focus on the most important scenarios to get faster, while not making other important scenarios (much) slower. Obviously sometimes perf improvements can bring unexpected perf regressions in other scenarios. Sometimes we may misjudge importance of certain scenarios. And sometimes some products just use weird unique patterns which we don’t consider important, but they hurt specific products (in which case we often recommend the products to migrate to better and more perfmant APIs). Again, not everything can be made just better at all times. Expecting that everything is just better, not matter what you use and how, is unrealistic in any large mature project or platform.

It turned out that this is not acceptable because the difference with Windows PowerShell was many hours! This despite the fact that it was the same code! If someone asks about porting heavily loaded web application to .Net Core, now we have to say - first of all buy additional resources (CPU and memory) - 23% at best and 100% if you are unlucky. For similar applications, it can be a lot of money.

With the previous paragraph in mind, the best next step is to investigate the perf difference and find the root-cause. Is PS using some weird pattern, which can be replaced with better one? Or did we truly accidentally hurt perf of a scenario which is important for more than PS? (BTW: so far only PS reported it)

I am very glad that @vancem agreed to look this problem deeply. Then I think we will know exactly why PowerShell Core consumes so many resources of CPU and memory and that we need to fix in PowerShell and/or .Net Core.

Agreed that it is good. However, I also want to clarify the expectations in general: We often do not scale to jump on every single report of allegedly slower perf or functional regression, especially when there is lots of in-the-middle code involved (like PS) - mainly because such investigations require often understanding of the higher-level framework code base to be done efficiently and to figure out where the perf fix could/should be (which may NOT be in .NET Core layer). That’s why we often ask for repros without in-the-middle frameworks (like PS), or we ask users/higher-level frameworks to do first-level analysis (ideally leading to isolated repro without the higher-level framework involved at all - the original report in this issue is great example of that). Obviously we are more attentive to problems which have larger impact - signs to be hitting multiple customers across multiple higher-level frameworks - those are likely to be unintended regressions we caused by accident or by mistake in our layer.

I still believe that this method is root and worth it to be faster

It would be best to support it with data by showing how much it could help end-to-end scenarios (e.g. PS ones). It will help us prioritize such improvements against other bugs and enhancements. It is unfortunate that this issue became mix up of multiple problems - the fact that GetHashCode became slower (partially by design with potential room for improvements via vectorizing) vs. end-to-end PS scenario is significantly slower (without clear root-cause at this point).

Given the mix up on this issue, I would recommend to close it and file separate issue for the potential GetHashCode improvement via vecotrization (with data how much it contributes to end-to-end scenarios like PS). We can continue root-causing the “general perf regression” either in PS issue, or a new one in CoreFX and see what it turns into.

I expect that .Net Core team will mobilize and make a significant investment to find bottlenecks and to remedy them.

We have done that already. There are no major bottlenecks that we are aware of. We believe that the current implementation of GetHashCode is the right choice for .NET Core even though it is slower to compute. Note that it produces much higher quality hashcodes and it is not hard to come up with examples where it makes the end-to-end execution faster because of reduced hash collisions.

If the .NET Core string hash function does not work well in some cases, our recommendation is to replace it with a custom hash code function specifically designed for the given case. For example, Roslyn has their own implementation of string hash function that works well for names used as C# identifiers.

@stephentoub The randomized seed should only introduce a small initial setup cost. I don’t think it can impact GetHashCode() this much. There were, however, a few performance regressions in the Marvin hash that were attributed to the removal of a couple of AggressiveInlining attributes, but I don’t seem to remember that all the regressions were found and fixed.

@vancem

I actually think we can close this issue (because the regression in GetHashCode is really by design).

After that “Thus .NET Core is time is 1.23X the Destkop time (23% slower).” I believe that developers will postpone migration to .Net Core - this is too noticeable a slowdown. I expect that .Net Core team will mobilize and make a significant investment to find bottlenecks and to remedy them. Just this issue shows a two-fold slowdown in GetHashCode(). Perhaps this is not a key issue for PowerShell but it is definitely a problem for .Net Core as a whole.

@vancem

I will take a look.

Many thanks! The issue is for tracking GetHashCode() - I suggest move to https://github.com/PowerShell/PowerShell/issues/7112

If the code has not changed but it works more slowly twice (I mean cmdlets), then it is more necessary to look into .Net Core.

@danmosemsft This issue is talking about GetHashCode() being slower on .NET Core vs .NET Framework, the PowerShell issue has shed some light onto it but I think it is independent of this. It may very well be that the PowerShell issue is a set of issues but I would think we want to improve GetHashCode on .NET Core, both by improving the code quality and in my opinion also providing the ability to quirk it off if needed.

In that sense I think closing this issue is premature.

In the original issue opened by @iSazonov, he noticed that Import-CSV was about 2x slower for a large (>100MB) csv file. However, we are talking about ~3.7s to ~6s in absolute terms. GetHashCode() is used quite a bit in PowerShell code base. However, I can certainly see the security benefit. @iSazonov are you ok with the perf degradation for your specific use case? If we find more common scenarios with significant impact, we should revisit this.