runtime: FirstOrDefault after OrderBy invokes predicate for every element (breaking change)

We have a bit of code to reserve an account from an available pool, which looks like this:

var account = accounts.OrderBy(x => x.UsageCount).FirstOrDefault(x => x.TryReserve(token));

After porting our code from .NET Framework to .NET Core, this now invokes the predicate method for every item in the list. In practise, this code now reserves ALL accounts and then returns the first.

The problem is with the OrderedEnumerable.TryGetFirst method, which is invoked from the FirstOrDefault extension (presumably as a performance optimization). It orders items lazily and skips elements that do not match the predicate, which therefore has the side-effect of invoking the predicate once per element.

Given how ubiquitous .OrderBy(...).FirstOrDefault(...) is in code bases everywhere, I am surprised that this change in behavior was acceptable. While it’s true that you should be careful with predicates with side effects, coalescing the two operations into one significantly changes what the code does (it is not ordering followed by filtering, as the code reads, but instead an amalgam of the two). Additionally, it’s subtle and may go undetected because the code compiles just fine - the worst kind of a breaking change.

@stephentoub

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 40
  • Comments: 81 (51 by maintainers)

Commits related to this issue

Most upvoted comments

I’m hesitant to change this.

I understand the concern, but having to maintain the exact number of times a predicate is invoked in something like LINQ is incredibly limiting. Consider OrderBy itself: saying we must invoke its comparer function as it was in the past means we couldn’t make any changes to the sort algorithm at all ever. I realize that’s not the exact case being discussed here, but it’s similar in nature: the concern is that we’re invoking a predicate differently than we did in the past.

The change has also been around for years, since the beginning of .NET Core, and even with the extensive use cited as motivation, this is the only complaint I’ve heard about it.

On top of all that, the perf implications are significant. This change effectively made something that was O(N log N) or O(N^2) into something that’s O(N). Reverting it now could absolutely negatively impact code.

I’d strongly recommend not using side-effecting functions with LINQ.

Two things are merged together in a single element, for no other reasons than performance. This is a slippery slope and not a good thing IMHO.

Say I have a .netstandard 2.0 library with code like:

var x = someEnumerable.OrderBy(a=>a.Field)
		.FirstOrDefault(b=>SomeClass.SomeVeryExpensivePredicateMethod(b));

Say the someEnumerable has 1000 elements and ordering it takes 1ms on netfx (I know that’s dead slow, it’s for simplicity). The SomeClass.SomeVeryExpensivePredicateMethod(b) method call takes 100ms. On netfx this whole operation now takes 1ms + 100ms is 101ms.

Running my .netstandard library in an app on .net core however will now take 1000*100 + 1 is 100s+

No side effect method used, a readonly, predicate method that does nothing with the argument passed in. Same IL. This is the problem. You merged two distinct methods for an optimization which does have a side effect. The optimization is understandable, and for a long time no-one filed an issue regarding a side effect (There might have been people running into this, remember: for every person filing an issue, there are numerous others who ran into the same problem but didn’t file an issue).

Of course there are workarounds, but that’s not the point. Linq to objects has a clear design: enumerable.ExtensionMethodA().ExtensionMethodB()....ExtensionMethodZ(); where one extension method works on the sequence passed in. Here, you merged two of these into one. Not in the public API but under the hood, for performance. While understandable, this isn’t right. Like in my example where the same IL works fine on netfx but performs terrible on .netcore due to this optimization.

As a library author of .netstandard libraries I need to be able to rely on APIs having the same net effect, like Linq to objects’ extension methods. Mind you, I target netstandard APIs here, and the behavior changes based on the platform I run on. That’s a red flag.

The only way to keep the optimization is to push it into roslyn: if the argument of FirstOrDefault isn’t a methodcall, call the optimized method, otherwise keep the ‘old’ behavior as-is. Only then you can avoid behavior changes / terrible performance dips (like I showed above) in e.g. netstandard targeting libraries.

(What surprises me a bit is that this is still a debate. Two distinct things were merged for performance (while good API design dictates: every method should do one thing and one thing only), and it has side effects. You then do the right thing and un-merge them to avoid the side effect.)

All that said, while I’m personally less motivated by the side-effecting case, it is probably worth looking at it again from a performance perspective. FirstOrDefault with no predicate is hard to argue with from a perf perspective. FirstOrDefault with a predicate now ends up invoking the predicate N times whereas previously it was likely to invoke it a small number of times, and if that predicate performed a non-trivial computation, it could easily outweigh the gains from a small-to-medium size input being sorted.

By the way, thanks to everyone at Microsoft for listening here. I understand that it can be frustrating when optimizations are undone by people writing code that can be viewed as “misbehaving”, especially when the changes have been in effect for several versions now. At the end of the day I think the right choice was made to keep a behavior that was at least consistent and free of “pockets of different behavior” that people need to pay attention to.

I have enormous respect for @stephentoub for what he has accomplished over the years, from TPL, CDS and TPL Dataflow to all manner of multi-threaded and concurrent library work and now a general eye on performance across all of the BCL, and I understand that it can’t have been very pleasant to have people descend on you in the way this thread unfolded - you have my deepest gratitude for all your work over the years and .NET would be a much less hospitable environment without technologies you have been involved with and been one of the public faces for. Thanks also to @terrajobst who posted the thread, which is how I found this.

I agree that the optimization is surprising. But I would also be surprised by an optimization to .OrderBy(selector).Where(predicate) in which the predicate is executed first. As previously suggested, if we can’t know whether the sorting or the filtering is more expensive, we should respect the decision of the developer to sort first.

Would .Where(predicate).OrderBy(selector).FirstOrDefault() effectively allow us to “opt in” to the .OrderBy(selector).FirstByDefault(predicate) optimization? Perhaps an analyzer could propose the former when it sees the latter?

@jzabroski I remember code contracts, and well understand the difference between an object and a function 😉

I also keep seeing the notion that developers are using LINQ “wrong”. This is, IMO, nonsense. People are using it like this and it has worked in a certain way for a long time. Years. There’s no protection from the compiler, nothing. Imagine some delegate in there that deleted the first customer in a table from the yielded sequence and then returned true. Worked on NET Framework. Run on Core: Boom. And you won’t even know it until that code is run - maybe weeks or even months after the migration. You might not even know it after the code goes wrong, but until some report is generated showing customers have gone missing. MS: Can you imagine the cost, effort and angst required for one of your customers to realise what has happened?

I think the fundamental issue here is this: “Is it worth getting better performance for most users, at the cost of some users getting completely unexpected - potentially destructive - behaviour?”

I don’t have any skin in the game - I don’t even use LINQ any more - but I remember happily using it for years and this change seems like a high risk one to me.

I’m really on the fence with this. One one side, I really like performance improvements and gladly accept a few oddities with them. On the other hand, as some examples here show, these performance improvements are really dependent on your use case: If the OrderBy is expensive, then this change makes it perform better; but if the FirstOrDefault is expensive, then that simply isn’t true.

What bothers me most is that it conflicts with the very common concept that is used to explain LINQ: Every LINQ method operates independently in a pipeline where one method passes items to the next. In that world, FirstOrDefault() could be written like this (simplified):

public static T FirstOrDefault<T>(this IEnumerable<T> source, Predicate<T> predicate)
{
    foreach (var item in source)
        if (predicate(item))
            return item;
    return default(T);
}

In fact, writing such simplified implementations yourself is a very common way to learn LINQ and is often used to teach the idea that each method operates on its own.

But this optimization now breaks that assumption: You now have the combination of OrderBy().FirstOrDefault() work differently, and in a way that is difficult to comprehend and not intuitive when compared to the other parts. I agree with others here that the expectation for IQueryable<> is that the provider will happily combine the query so that it can work with it most efficiently. After all, I’m writing a declarative query there. But I don’t think that the same is necessarily true for IEnumerable<>. I feel like this optimization should rather be a new OrderedFirstOrDefault that makes it clear that this is a combined operation.

As for why nobody ever said anything about this in four years: I think it’s simply because you have to actively look for this to find out about this behavior. Unless you actually hit a performance issue or have avoidable side effects, you simply won’t see this changed behavior. Yes, in those cases it likely doesn’t matter either way, but if you do hit this issue, then I think the current implementation is not really intuitive.

saying we must invoke its function as it was in the past means we couldn’t make any changes to the sort algorithm at all ever

I’m hesitant to object here because I know you’re a lot smarter than me, but I don’t see why the predicate passed to First(OrDefault) should have any influence on the ability to change the sort algorithm in OrderBy.

the concern is that we’re invoking a predicate differently than we did in the past.

My personal concern is about readability. LINQ used to be a sequential method chain, where you could reliably consider each step separately. With this change, you can no longer decipher what will happen purely by reading the code - you must know the implementation as well.

The change has also been around for years, since the beginning of .NET Core, and even with the extensive use cited as motivation, this is the only complaint I’ve heard about it.

The danger of silently breaking existing behavior is that people may not have discovered the problem, or won’t discover it until after it has caused them trouble.

On top of all that, the perf implications are significant. This change effectively made something that was O(N^2) into something that’s O(N). Reverting it now could absolutely negatively impact code.

Even if there is some gain to be made by combining the two (as the current implementation does), it should rightly be a new method (e.g. FirstWithOrderBy) instead of a replacement (because the behavior is different, breaking and non-intuitive).

I’d strongly recommend not using side-effecting functions with LINQ.

This was always true but the code is succinct and readable, which to a lot of folks matters more than purity. I expect there are thousands of custom ForEach implementations for the exact same reason.

Please reconsider.

Just as an aside to this - I created a repository here and did some basic property-based tests against a few methods (Select, Where, FirstOrDefault, LastOrDefault, SingleOrDefault) to compare old and new implementations (I basically ripped out the old implementations from here).

Interestingly, whilst FirstOrDefault exhibits different behaviour on an ordered enumerable, LastOrDefault exhibits different behaviour on unordered enumerables: the netcore implementation is quicker in that is (appears?) to start from the end of the list and stops when it finds the first match, whereas the old implementation simply traversed the entire collection.

In this case, people might not mind because there’s an efficiency gain here, but it’s still a change in behaviour in terms of calling the supplied higher order function.

Results below along with minimal test cases (although the predicates that generated the results are not shown).

Running tests!
[17:03:00 INF] EXPECTO? Running tests... <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Core.FirstOrDefault is deterministic starting... <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Core.LastOrDefault is deterministic starting... <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Core.SingleOrDefault is deterministic starting... <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Core.LastOrDefault is deterministic passed in 00:00:00.3020000. <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Core.FirstOrDefault is deterministic passed in 00:00:00.3360000. <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Core.Select is deterministic starting... <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Core.SingleOrDefault is deterministic passed in 00:00:00.4200000. <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Core.Where is deterministic starting... <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Framework.FirstOrDefault is deterministic starting... <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Framework.LastOrDefault is deterministic starting... <Expecto>
[17:03:00 DBG] All.Deterministic tests..NET Framework.SingleOrDefault is deterministic starting... <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Framework.Select is deterministic starting... <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Core.Where is deterministic passed in 00:00:00.7540000. <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Framework.FirstOrDefault is deterministic passed in 00:00:00.3030000. <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Framework.LastOrDefault is deterministic passed in 00:00:00.7490000. <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Framework.SingleOrDefault is deterministic passed in 00:00:00.7800000. <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Framework.Where is deterministic starting... <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Core.Select is deterministic passed in 00:00:01.1690000. <Expecto>
[17:03:01 DBG] All.Comparing .NET Core to .NET Framework.Single method.FirstOrDefault has not changed starting... <Expecto>
[17:03:01 DBG] All.Comparing .NET Core to .NET Framework.Single method.LastOrDefault has not changed starting... <Expecto>
[17:03:01 DBG] All.Comparing .NET Core to .NET Framework.Single method.SingleOrDefault has not changed starting... <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Framework.Select is deterministic passed in 00:00:00.6910000. <Expecto>
[17:03:01 DBG] All.Deterministic tests..NET Framework.Where is deterministic passed in 00:00:00.3960000. <Expecto>
[17:03:01 DBG] All.Comparing .NET Core to .NET Framework.Single method.Select has not changed starting... <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.Single method.Where has not changed starting... <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.Single method.FirstOrDefault has not changed passed in 00:00:00.0770000. <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.FirstOrDefault has not changed on OrderedEnumerable starting... <Expecto>
[17:03:02 ERR] All.Comparing .NET Core to .NET Framework.Single method.LastOrDefault has not changed failed in 00:00:00.3620000. 
Failed after 7 tests. Parameters:
        [|null; '#'; ""; "d
                           :"|] <fun:Invoke@2712>
Shrunk 4 times to:
        [|true; true|] <fun:Invoke@2712>
Result:
        Exception
  Expecto.AssertException: Both results should be the same.
expected:
Ok { CallCount = 2
     Result = true }
  actual:
Ok { CallCount = 1
     Result = true }
   at Expecto.Expect.equalWithDiffPrinter@339-15.Invoke(String msg)
   at Expecto.Expect.equalWithDiffPrinter$cont@321[a](FSharpFunc`2 diffPrinter, a actual, a expected, String message, Object e, Object a, Unit unitVar)
   at Program.quickCompare@29TTTTTTD.Invoke(FSharpFunc`2 higherOrderFunction) in C:\Users\Isaac\Source\Repos\testlinq\Tests\Program.fs:line 36
   at FsCheck.Testable.evaluate[a,b](FSharpFunc`2 body, a a)
Focus on error:
        testPropertyWithConfigStdGen (0, 0) "LastOrDefault has not changed" <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.Single method.SingleOrDefault has not changed passed in 00:00:00.2350000. <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.Single method.Select has not changed passed in 00:00:00.5420000. <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.Single method.Where has not changed passed in 00:00:00.3300000. <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.Select has not changed on OrderedEnumerable starting... <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.LastOrDefault has not changed on OrderedEnumerable starting... <Expecto>
[17:03:02 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.SingleOrDefault has not changed on OrderedEnumerable starting... <Expecto>
[17:03:02 ERR] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.FirstOrDefault has not changed on OrderedEnumerable failed in 00:00:00.0640000.
Failed after 7 tests. Parameters:
        [|0; -1; 0; 2|] <fun:Invoke@2712>
Shrunk 3 times to:
        [|-1; 0|] <fun:Invoke@2712>
Result:
        Exception
  Expecto.AssertException: Both results should be the same.
expected:
Ok { CallCount = 1
     Result = -1 }
  actual:
Ok { CallCount = 2
     Result = -1 }
   at Expecto.Expect.equalWithDiffPrinter@339-15.Invoke(String msg)
   at Expecto.Expect.equalWithDiffPrinter$cont@321[a](FSharpFunc`2 diffPrinter, a actual, a expected, String message, Object e, Object a, Unit unitVar)
   at Program.quickCompare@29TTTTTTD.Invoke(FSharpFunc`2 higherOrderFunction) in C:\Users\Isaac\Source\Repos\testlinq\Tests\Program.fs:line 36
   at FsCheck.Testable.evaluate[a,b](FSharpFunc`2 body, a a)
Focus on error:
        testPropertyWithConfigStdGen (0, 0) "FirstOrDefault has not changed on OrderedEnumerable" <Expecto>
[17:03:03 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.Select has not changed on OrderedEnumerable passed in 00:00:00.2120000. <Expecto>
[17:03:03 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.LastOrDefault has not changed on OrderedEnumerable passed in 00:00:00.0540000. <Expecto>
[17:03:03 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.SingleOrDefault has not changed on OrderedEnumerable passed in 00:00:00.0710000. <Expecto>
[17:03:03 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.Where has not changed on OrderedEnumerable starting... <Expecto>
[17:03:04 DBG] All.Comparing .NET Core to .NET Framework.OrderedEnumerable.Where has not changed on OrderedEnumerable passed in 00:00:00.0350000. <Expecto>
[17:03:04 INF] EXPECTO! 20 tests run in 00:00:04.2636672 for All - 18 passed, 0 ignored, 2 failed, 0 errored.  <Expecto>
  • The ask is now to change a behavior that exists since .NET Core 1.0, which would impact people going from an older version of .NET Core to a newer version.

I don’t believe this is correct. .NET Core developers cannot use lambdas with side-effects precisely because there is no guarantee how many times they will be invoked. Adding back that guarantee will thus not break existing code.

My point is that code often relies on side effects by accident. People test their code in the context of a given framework, and that assumption might break later. Within .NET Framework we wouldn’t have made this change in either direction, because it’s likely breaking someone. On .NET Core I can see us making that change in either direction, which is why I said we might convince ourselves that it’s an acceptable breaking change to undo the optimization. Breaking changes need to be properly motivated and significantly improving perf is a good justification but so is breaking less code.

It seems your arguments are:

  1. It broke my code so it’s likely affecting other people’s code too. Given the current behavior shipped four years ago, the number can’t be that high or we would have heard about it more often.

  2. It’s not intuitive. I can buy that the behavior can be surprising, but we need to consider the larger picture here. You’re asking us to change LINQ so that the predicate passed to FirstOrDefault() is only executed once. What about all the other methods that accept lambdas? Why is FirstOrDefault() special? If we want a consistent set of rules that customers can reason about then we’d have to document each and every method and disallow any optimization that would shuffle operators around. I don’t think we’ll want that. If anything, we’d want more flexibility. Other people have asked for the compiler to “see through” lambdas and create more efficient code by lowering things into for loops and inline lambdas. Looking at the other examples of LINQ (EF, or more generally query plans in SQL engines) the norm is that queries aren’t an implementation spec but a declarative description. Preserving side effects seems to me to be too limiting to be a good design choice here.

If I’m reading it right, we made some optimization for ordered enumerable that leads to the predicate to FirstOrDefault being run over more than just the first few items (all of them by the sounds of it?). I can absolutely understand both arguments here.

I worry about using the argument that everything is a breaking change to justify a breaking changes. I think we should be practical in our application of data and risk to decide which breaking changes add value to the ecosystem, and which don’t.

We can get a proxy for risk of breaking folks here by looking at the number of NuGet packages that reference the predicate form of FirstOrDefault and look at the breakdown. It’s not a perfect proxy, since we can’t tell side-effecting predicates, or how they stack with ordering (without a bunch of work). Unfortunately my datasource isn’t yet public, so I can’t provide queries, but I’ve taken a stab at breaking out usage across the 3 populations who experience this behavior differently.

There are 5535 package ids that publish a netcoreapp assembly that references this api. There are 11908 package ids with a netstandard asset, and 30049 package ids with a “legacy framework” asset (note that some may carry all 3).

30k is a pretty large number. I tend empathize with .NET Framework developers alot in this situation. As .NET Core (particularly .NET 5) seeks to attract such .NET Framework code, I think we need to mitigate this tripping hazard, as it leads to potentially hard-to-diagnose reliability issues (perhaps on both sides of the break). I also hypothesize that we have more LINQ-style code in the .NET Framework group, since it was rolled out and hyped in the 3.5 timeframe, but I don’t yet have specific evidence of this.

However, the fact that there are quite a few netstandard assets suggests that maybe not too many folks are tripping over this (or aren’t really testing NS against all runtimes, or don’t have the susceptible pattern).

I think there’s also a performance angle here. We’ve identified scenarios in which the “performance optimization” seems to be pretty damaging to performance (if I understand correctly), so I think we need to think about that as well when identifying a mitigation.

It would be my opinion that we should match .NET Framework behavior unless there was clear value in not doing it. I haven’t internalized the value in the original change yet. Do we have perf numbers on real workloads (or proxy benchmarks)? Intuitively, running the predicate more seems less desirable.

Two things are merged together in a single element, for no other reasons than performance. This is a slippery slope and not a good thing IMHO.

Things like this have been done since the beginning of LINQ. Look, for example, at the reference source implementation of Where and Select; the implementation optimizes the implementation of Where.Where and Where.Select by merging their implementations, with a subsequent operation detecting the previous one and causing the combined operation to be implemented differently.

I very much disagree that such merging shouldn’t be done. The real question is when such optimizations are done and they impact observable behavior. Note that there are such optimizations as well since the beginning of LINQ, where the behavior is observably different from the naive foreach-over-the-source behavior. Consider:

using System;
using System.Collections.Generic;
using System.Linq;

class Program
{
    static void Main()
    {
        IEnumerable<int> list = new MyList() { 1 };
        foreach (int result in list.Select(i => i))
        {
            Console.WriteLine(result);
        }
    }
}

class MyList : List<int>, IEnumerable<int>
{
    IEnumerator<int> IEnumerable<int>.GetEnumerator()
    {
        Console.WriteLine("did this print?");
        yield return 42;
    }
}

My custom enumerable implementation provides its own GetEnumerator implementation, so a foreach-over-the source implementation would end up printing out “Did this print?” and “42”. But try it and you’ll see it print out “1”, because Select is special-casing List<T> and ignoring its IEnumerable<T> implementation.

My point is, in all of these cases, there’s often a lot more ambiguity and gray area than the strict purist view cited provides, and we collectively need to make a judgement call about the tradeoffs.

.FirstOrDefault(b=>SomeClass.SomeVeryExpensivePredicateMethod(b));

This is what I raised earlier as a valid concern: https://github.com/dotnet/runtime/issues/31554#issuecomment-628169717

At the same time, there’s nothing in the documentation that says “FirstOrDefault” is only going to execute the delegate until it finds the first. It says that it returns “default(TSource) if source is empty or if no element passes the test specified by predicate; otherwise, the first element in source that passes the test specified by predicate.” So forget OrderBy for a moment; a perfectly conforming implementation of that method alone could execute the delegate for every value and then return the first that passed. Is that a good idea from a performance perspective? Maybe, maybe not.

There might have been people running into this, remember: for every person filing an issue, there are numerous others who ran into the same problem but didn’t file an issue

And for every person filing an issue, there are many more not filing an issue because they’re not facing one and benefiting from the improved performance the change enabled for them.

Mind you, I target netstandard APIs here, and the behavior changes based on the platform I run on. That’s a red flag.

This is an argument against ever taking a bug fix, ever making a performance improvement, ever evolving .NET Core. We do not maintain bug-for-bug compatibility with .NET Framework nor with previous releases of .NET Core. We do want to document places where behavior changes meaningfully so as to help developers migrate and upgrade, but that doesn’t mean we don’t make such changes; it means we’re thoughtful about them, the risks, and the rewards.

So, in a case if sideeffecting/counting behavior is expected from FirstOrDefault , is there a simple workaround?

OrderBy(…).ToArray().FirstOrDefault(…)

?

this is just hyperbole

It is not hyperbole, and suggesting that I’m exaggerating or making false claims just makes me want to stop participating in an otherwise healthy discussion. So I’m going to pause on this for now.

Hmm. So, performance trumps the safety that you can depend on an API doing what it should because otherwise the platform can’t move forward? Because in the end that’s the net result.

As a developer of netstandard targeting libraries I’m seriously disappointed in this. For bugfixes, platform limitations/aspects… you do what you can, and these things happen. Here, the optimization is very clever and if there are no side effects, great. But there are side effects. What I expected to see here was the .net team realizing they had overlooked a situation (hey, we’re all human) and looking for a way to mitigate the side effect and at least make user code work like expected.

Instead, reluctance and excuses are given. I don’t see how that is helping anyone.

@stephentoub That’s all fine and good, the thing at hand here is that code which is read as-is behaves differently based on the runtime it’s run on, and not because of platform limitations or missing APIs, but for the sole reason of performance shortcuts. A .netstandard 2.0 library behaves differently on the different platforms while the code is using default implementations. Not custom IEnumerables, but default extension methods.

Optimizations are great, but if you change behavior because of that, that’s a bridge too far. It’s like cutting a corner in the hope no-one will notice, and when one does, a series of excuses is started. That’s what struck me the most in this thread: that it’s highly unlikely that this behavior change is going to be reverted. What other optimizations are done which change behavior we don’t know about but might hurt us big time in some situation? You glanced over it, but it’s key I can rely on behavior as defined isn’t changing because of some shortcut being taken on some platform.

This is an argument against ever taking a bug fix, ever making a performance improvement, ever evolving .NET Core. We do not maintain bug-for-bug compatibility with .NET Framework nor with previous releases of .NET Core. We do want to document places where behavior changes meaningfully so as to help developers migrate and upgrade, but that doesn’t mean we don’t make such changes; it means we’re thoughtful about them, the risks, and the rewards.

I’m sorry but this is just hyperbole. This isn’t about a bugfix, this is about behavioral change in a .netstandard api depending on the runtime it runs on. If behavior needs changing, then APIs need changing, isn’t that what the usual approach is? yes, netfx is old cruft and you might see it as limiting, but sorry to break it, there are many people out there who still have to support that framework. .netstandard helps a great deal with that.

Till behavior changes like this pop up and it makes releasing a single codebase compiled to .netstandard a shaky business as when a user of that library runs into this issue they’re not going to notify you. They’re notify the library author.

I very much disagree that such merging shouldn’t be done. The real question is when such optimizations are done and they impact observable behavior. Note that there are such optimizations as well since the beginning of LINQ, where the behavior is observably different from the naive foreach-over-the-source behavior.

This is fair and the observable behavior is the important part. When .Any() checks if it’s a list and whether its Count is larger than zero, that’s a great optimization because it does the obvious thing without impacting the user’s expectations or the method’s contract. Same with various classes to merge where+select pipeline stages, and so on.

I’m personally far from a purist because the purist would say “this is a declarative function that’s only well-defined for non-side-effecting functions, and if you have side-effecting functions you deserve whatever madness comes your way”.

The bit that worries me isn’t that an optimization was made but that I can’t find any case where the post-optimization behavior is what anyone who uses a side-effecting function would ever expect. In other words: I think it’s hard to find code that uses side-effects that safe-guards against this change, and very easy to find uses where the stop-immediately behavior is assumed. Is that part of the method’s contract? Maybe not explicitly. The iteration variable in C# foreach statements not being part of the loop’s scope was correct according to the language specification too, but it broke people’s brains and indirectly caused people to write bugs in a similar way.

Consider also that Roslyn now has code fixes to turn loops into LINQ and LINQ into loops. That strengthens the already strong assumption that LINQ’s observable behavior is “like loops”, that xyz.OrderBy(...).FirstOrDefault(predicate) is equal to foreach (var z in xyz.OrderBy(...)) { if (predicate(z)) { break; } }.

(As additional appeal to this being “in the water”, if maybe not strictly “on the books”, Jon Skeet’s EduLINQ reimplementation literally has a test checking for this.)

The primary cost here to me is that these semantics will likely be suprising to 99%+ of developers. I just hope decision makers here take this into account if they decide to value the potential performance gain higher.

I agree with @stephentoub and said this earlier:

Looks like that change was included since .NET Core 1.0, changing that now would have issues for folks already on .NET Core.

Given that behavior is in the product for so long and this is the first time I’ve heard about it, I’m not sure it’s that impactful to be honest.

On top of the arguments that were made I’d like to add this:

  • This current behavior impacts people when moving from .NET Framework to .NET Core.
  • The ask is now to change a behavior that exists since .NET Core 1.0, which would impact people going from an older version of .NET Core to a newer version.

Generally speaking, moving between different frameworks is always a breaking change while we try hard make moving to a later version of the same framework a smooth experience with as few breaks as possible, ideally none.

And the argument why that is a breaking change in .NET Core is for the same reason: if the predicate has side effects it can alter the behavior of your program. However, even if we were to accept that as a less impactful breaking change it would still result in a perf regression, which can be significant. Like @stephentoub I find myself not eager to accept that fallout.

On top, there is the question what guarantees we want to provide for LINQ. It seems indeed too limiting to say that order of operations and number of calls of provided callbacks can’t change.

@tibel

A query describes what you want and not how to get there.

There is an immense amount of value in being able to look at a piece of code and intuitively understand what is happening - or at the very least, not be unpleasantly surprised by it.

First and FirstOrDefault strongly communicate that they are NOT going to iterate the full sequence, and by implication, that any delegate passed will not be invoked for every element.

Same should be true for LINQ. The runtime is allowed to change the execution plan as long as the result is the same.

There is no ask here to prevent other optimizations. Merely that methods that imply they are not needing the full sequence also don’t visit the full sequence (unless required).

Using expressions that mutate state in LINQ are a mistake in my opinion.

And you’d usually be right, but there are cases where it makes for succinct and declarative code.

And as others have pointed out, if the cost of the delegate is non-trivial you could end up with code that was both slower and surprising in behavior. An optimization that is not a clear performance win, introduces pitfalls and reduces readability should not replace the default behavior, but instead be a new method that interested parties can opt-in to.

Just throwing a hat in the ring here. I’ve been hit by this change in the past, trying to do something very similar to what @mnmr originally reported ( from a list of workers ordered by calculated priority, lock/reserve the highest priority worker and return it). Using linq made for succinct, readable code that unfortunately didn’t work. Following @stephentoub suggestion of adding ToArray / ToList would certainly work. However, no one who doesn’t know about this specific edge case is going to know why that call to ToArray/ToList is there, and is likely to remove it thinking they’re saving some cycles. I could also just avoid using linq, but being able to use linq to express a solution like this is what makes me love c#.

Just doing a quick poll around my own network, only two devs could tell what the actual execution count of the predicate was going to be. One was sitting with me when I got hit the first time, and the other only because they’d already seen @terrajobst twitter thread.

This feels like a hidden sharp edge in linq that may be solving a problem that primarily exists in benchmarks, though of course that’s just my opinion. I don’t think I’ve ever seen too many places where in memory sorting wasn’t immediately preceded by filtering the list so only necessary items were sorted in the first place.

On .NET Core I can see us making that change in either direction, which is why I said we might convince ourselves that it’s an acceptable breaking change to undo the optimization.

What particular implementation detail could .NET Core developers rely on here that would make reverting a breaking change?

The only thing I can think of is if you have code that relied on the lambda passed to First being called at least once for every item, despite only returning the first element. But if I had code like that I would want you to break it for me.

Breaking changes need to be properly motivated and significantly improving perf is a good justification but so is breaking less code.

As Stephen said: the additional performance gains from special-case handling of OrderBy(…).First(…) could very well end up being lost if the lambda passed to First has any non-trivial code in it. So the net gain in performance is questionable. I have no objections to optimizations made purely to the OrderBy implementation.

Why is FirstOrDefault() special?

But that’s my point exactly. Why have special-case handling for First-and-friends that behaves differently from (a) what it used to be and (b) what is intuitive behavior from looking at the code.

Injecting ToList/ToArray between the two operations, as the current workaround, should really be a no-op (except for the expected performance hit).

If we want a consistent set of rules that customers can reason about then we’d have to document each and every method and disallow any optimization that would shuffle operators around.

The ask here is only that terminating expressions that do not visit all items “by nature” (so pretty much only First and FirstOrDefault) also behave as such.

For IQueryable there is an implicit understanding that you are writing LINQ expressions that are then translated into something else, e.g. SQL. I don’t think it makes sense to consider them under the same umbrella as plain LINQ, where there is no expression tree translation involved (nor do users expect one to be).

Agree, I aspire to be as good an engineer as @stephentoub .

The ask here is only that terminating expressions that do not visit all items “by nature” (so pretty much only First and FirstOrDefault) also behave as such.

Now we’re talking. That answers my question of why you think FirstOrDefault is being special. I actually think that’s an interesting point.

I fully agree with this point. It’s reasonable to infer, from a lazily evaluated sequence, that something which grabs the first element stops when it has the first element and this behavior has been the actual behavior of most or all LINQ operators of this kind since LINQ was introduced. This ticket is frightening because it’s as if the && operator sometimes stopped being short-circuiting due to certain optimizations. FirstOrDefault and its ilk are even called LINQ operators, and this turns into “spooky action at a distance” when things jump from one place to another in the pipeline.

If it was AsParallel()..., you would have opted into a different behavior (PLINQ) which undoes those expectations. Here, no such behavior is desired, and there’s a long history of the behavior since .NET 3.5. If this behavior is kept there really needs to be a readily available analyzer flagging this - this is a landmine, and I think the compatibility concerns should be reconsidered due to being so close to a bug (since the change in behavior wasn’t announced), due to Microsoft now heavily pushing everyone to move to .NET Core/.NET 5 and due to compatibility concerns therefore going at least as strongly the other way. To say nothing of the major confusion .NET Standard libraries will be in.

I don’t see why the predicate passed to First(OrDefault) should have any influence on the ability to change the sort algorithm in OrderBy.

I was referring to the predicate passed to OrderBy itself. It shouldn’t be any more or less special than other predicates in LINQ.

LINQ used to be a sequential method chain, where you could reliably consider each step separately.

All sorts of LINQ implementations do optimizations that involve rearranging the query, e.g. Entity Framework, and more generally most any LINQ implementation based on queryables.

The danger of silently breaking existing behavior is that people may not have discovered the problem, or won’t discover it until after it has caused them trouble.

This is true for practically any bug fix, functional, performance, or otherwise. Developers can take dependencies on all sorts of things that weren’t intended, weren’t documented, were intended as an implementation detail… something behaved one way and they took a dependency on it behaving that way.

I expect there are thousands of custom ForEach implementations for the exact same reason.

It’s also why ForEach has been proposed numerous times and has never been added to Enumerable: Enumerable isn’t meant to be used with side-effecting callbacks.

Please reconsider.

That’s what we’re doing by having this discussion.

The sort operation specified with OrderBy is evaluated at enumeration of the sequence, not when the method is called, so no sorting takes place in the situation of e.OrderBy(predicate).FirstOrDefault(predicate2) till FirstOrDefault is called!

This lazy evaluation is prevalent throughout most of LINQ and LINQ would be very different without it. FirstOrDefault/First/SingleOrDefault/Single/ToArray/ToList/ToSomething return definite values and therefore trigger enumeration of the sequence to some degree, and most things that don’t just add another step for the enumeration to perform. Adding LINQ operators like GroupBy, Where, OrderBy, Join and so on all wrap the current “query object” into a new “query object”, and enumerating means asking the previous step to enumerate as many items as are required (if it’s Where, you can stream just one item at a time, if it’s OrderBy or GroupBy which need most or all of the information, it has to enumerate more things) and then do the thing the current step is responsible for.

The prevalence of lazy evaluation is one of the great things about LINQ - you can build up a pipeline that runs as little as possible. And since you do it in a declarative way, just by switching to another type, the query can be translated (IQueryable), or you can opt into PLINQ with AsParallel and get a different behavior. Even if it wasn’t language-integrated it’d still be a very well designed API.

While I wouldn’t be worried about a breaking change, I agree that .FirstOrDefault strongly implies that the operator is going to iterate on the collection until finding the right element. This feeling is strengthened by my previous experience about Linq. I tend picture Linq queries as two parts: one filtering part (OrderBy, Where, Select,…), in which I fully expect the operations to be reordered/swapped to perform as well as possible. And one projection part (First, FirstOrDefault, ToList, …) that I expect to run after the filtering. That’s the reason this optimization bothers me.

Basically, I picture the following query source.OrderBy(i => i.Id).Where(i => IsApproved(i)).Select(i => i.Date).FirstOrDefault(i => IsLeapYear(i)) as "get the date of all the approved items, sorted by id, THEN take the first one that is a leap year. I wouldn’t ever take a dependency on how often “IsApproved” is called, but I have strong expectations about how often “IsLeapYear” would be called.

var query = source.OrderBy(i => i.Id).Where(i => IsApproved(i)).Select(i => i.Date); // Filtering
var result = query.FirstOrDefault(i => IsLeapYear(i)); // Projection

Interestingly, it means that, even though those two queries are equivalent:

source.OrderBy(i => i.Id).Where(i => IsApproved(i)).First()
source.OrderBy(i => i.Id).First(i => IsApproved(i))

I wouldn’t mind if IsApproved were to be called an arbitrary number of times on the first one, but it comes as a surprise that it would be called for each element in the second one.

I don’t know if it’s helping or making things even more confusing. I surprised myself by having different expectations about Where(...).First() and First(...), so I thought it’d be interesting to deconstruct the reasoning.

In other words, if you want to disable the optimization for now, just use Aggregate. - Much clearer 😃

I’d rather stick a ToList in there than having to decipher Aggregate later on. Code readability matters more than performance in most places, I’d argue.

The question that needs to be resolved here is whether a (likely but uncertain) performance improvement should come at the cost of users expections from looking at the code. It’s not just about whether to introduce a breaking change to Core, but rather a fundamental platform design choice that may guide future decisions as well.

There is a breaking change list – e.g., this one is for .NET Framework -> .NET Core.

The process to add to it is to open an issue with this template: https://github.com/dotnet/docs/issues/new?template=dotnet-breaking-change.md

@danmosemsft Thank you for linking this process. I see that although this change went in years ago, it hasn’t made it to that list yet.

Question: Where does the responsibility lie for submitting breaking changes for that list?

I ask because I’m wondering both:

  1. if I should follow this process to submit this breaking change to that list myself.
  2. how confident I should feel that other breaking changes like this aren’t also missing from that list.

this is just hyperbole

It is not hyperbole, and suggesting that I’m exaggerating or making false claims just makes me want to stop participating in an otherwise healthy discussion. So I’m going to pause on this for now.

There is a breaking change list – e.g., this one is for .NET Framework -> .NET Core.

The process to add to it is to open an issue with this template: https://github.com/dotnet/docs/issues/new?template=dotnet-breaking-change.md

Reading through that discussion is a little terrifying. This significant breaking change is glossed over as simply being not “a ‘pure’ optimization”.

Is there any comprehensive list of breaking changes that includes this one? I’d like our department to move to Core, but it now seems like an unwise move if changes like these have been made silently.

Any chance of opening back up the original discussion that led to this change? @stephentoub @VSadov

An important characteristic of the standard query operators is that they are implemented as pure functions.

It implies therefore that the overloads that accept external predicates should also be pure, and teaches you how to think in terms of pure functions.

I don’t think that follows. I think it just means that they themselves work in terms of pure functions. Builders, which look like LINQ operators when they’re written in a “fluent” style, often don’t, and just mask setting flags or fields on some underlying object, but the LINQ operator call itself should be pure in that it returns a new object, and that if you have saved an old object, it should not be changed. ie: var x = seq.Where(d => d % 2 == 0); var y = x.Select(d => d * d); should allow using x afterwards without the Select having made destructive changes to a fused WhereSelect optimization operator, because you have no idea where the result of the Where operator went. For all you know it’s reused in hundreds of programmatically generated queries. (This does not apply to this issue’s contested change, just to the quote above and what it means in general.)

I would be completely onboard with a restriction to pure functions in predicates and selectors (input functions to LINQ operators) as long as it could be documented and verified by the compiler and the type system. That would allow more optimizations safely, and it would let us know what could and what couldn’t be done. If there’d been error messages from the start, this would never have been an issue. But with such a restriction not being in place, it’s dangerous for the implementation to suddenly change the assumptions about what the code does now that code has been written for 12 years with a different set of assumptions.

The two-phase solution listed in an answer before assumes that you can split it into two operations. Maybe that’s impossible because both steps need to be taken atomically, or because they’re handled by code you don’t control.

At the root of this, it comes down to:

var x = [...].FirstOrDefault(impure function);

There’s 12 years of programmers writing that instead of:

foreach ([...]) {
    if ([impure function]) {
        break;
    }
}

When both solutions are available, people often choose the shorter one. If the first one had been ruled out by a more capable system, this would never have been possible. (Also, people really like the shorter code: GitHub and NuGet are both full of projects that implement .ForEach(...) extension methods for IEnumerable to avoid using foreach directly. I can’t say I agree with that personally, but that’s where it comes from.)

@Nirmal4G such code can simply be written using a foreach loop:

foreach (var account in accounts.OrderBy(x => x.UsageCount))
{
  if (account.TryReserve(token))
    return account;
}

Due to the side effect this is probably how I would prefer the code to look anyway.

Looks like that change was included since .NET Core 1.0, changing that now would have issues for folks already on .NET Core.

The ask is now to change a behavior that exists since .NET Core 1.0, which would impact people going from an older version of .NET Core to a newer version.

With all my due respect to you both, I thought we already learned the lesson that the mantra “fixing this bug would break the code that depends on current [incorrect] behavior” is considered to be incorrect. No?

This current behavior impacts people when moving from .NET Framework to .NET Core.

With .NET 5 there will be no more moving between frameworks. It’ll be moving between version 4 and version 5. Why so often (once is already too often) “backward compatibility” is an argument to do not fix what ought to be fixed? Backward compatibility is for preserving good stuff, not the other way around.

In this case it sounds more like “we broke it so long time ago [and it went unnoticed] so fixing it now is kind of too late”. No, it’s never too late. Fix it in .NET 5 then. Keep it the behavior of .NET Core and keep the behavior of .NET 5 as same as in .NET 4. Problem solved.

As a quick thought: While Mark looked at nuget.org (great analysis!), what impact would rolling back these changes have on ASP.NET’s performance in the TechEmpower benchmarks?

If you want to do well in benchmarks of that sort you need to be writing allocation-free code (among other things), and are thus not using LINQ at all.

Reverting the optimization would also only affect LINQ to Objects (aka in-memory queries). Unless you are querying over large collections, or explicitly writing benchmarks to measure the performance difference, you are unlikely to be able to tell which implementation is used.

@tibel

A query describes what you want and not how to get there. You cannot read a query as imperative statements that follows OOP (and SOLID) principles as a query is truly functional.

This is true, except: IEnumerable interfaces discussed here return IEnumerables, not IQueryables. They are not “queries” in the declarative sense, they’re the dual with identical names. As I mentioned in my earlier comment, when logic programs try to do more than just be logic programs, they run into problems where developers are forced to think operationally, and end up being surprised when those operations interact with updates, persistence, events, delays. I listed some operational considerations in my previous reply. For my taste, LINQ is just a marketing phrase and the fact IEnumerable lives in the LINQ namespace has more to do with Erik Meijer and other functional programming researchers seeking to create the “universal collections api”.

Reasoning from first principles, Erik and Gavin note:

Looking up a value given its address or key in a key-value store can involve a computation, which in a truly open world has potential latency and may fail. For example, in the C# language getters and setters, known as properties, can invoke arbitrary code. Perhaps an even better example of a computation-driven key-value store with long latency and high probability of failure (always able to handle a 404) is the Web, with URIs as keys, “resources” as values, and the HTTP verbs as a primitive query and data-manipulation language. On the other hand, in a C-like key-value memory model, we usually make the simplifying assumption that a key lookup in memory takes constant time and does not fail.

Traversing a relationship in the closed world of the relational model involves comparing two values for equality, which is guaranteed to succeed because of referential integrity; and vice versa, referential consistency dictates that relationships are value-based. Otherwise, we could never be sure that referential consistency actually holds.

I find the dual open world/closed world useful in thinking about what is okay to optimize. In an open world, compilers can implement escape analysis and other analysis to convert open world problems into locally closed world problems, and then it would be okay to optimize.

In addition, in my understanding, functional programmers have greater control here because they use a trick called SuperFold which allows the OrderBy to seamlessly integrate a higher order function (FirstOrDefault). But I don’t really know what peephole tricks ML compilers do on fold operations. I would expect an FP standard library not to perform tricks, and would expect that the developer implement a SuperFold call instead, but don’t know if that expectation is widespread.

That said, the holy grail of computer science is to “separate specification from execution”.

@isaacabraham I saw a reply by you via email but it vanished. In .NET, predicates are not guaranteed to be pure and stateless, and predicates are objects, and objects are fundamentally higher-order things with possibly “true hidden state”. There was an attempt to add “code contracts” in .NET framework, but they were largely removed in .NET Core as part of the MS Researcher leading the efforts moving on to Facebook and general problems scaling code contracts with concurrent execution.

2. It’s not intuitive. I can buy that the behavior can be surprising, but we need to consider the larger picture here.

Hmm… coming from a logic programming background, I was always taught as an engineer that Ordering == Enumerating == Counting == Synchronization. Coming from a Datalog perspective where that is 100% true… until they allowed extensions like updates / persistence. Then came Daedalus to help :

Quote from research paper / language design rationale paper abstract (emphasis mine):

Recent research has explored using Datalog-based languages to express a distributed system as a set of logical invariants. Two properties of distributed systems proved difficult to model in Datalog. First, the state of any such system evolves with its execution. Second, deductions in these systems may be arbitrarily delayed, dropped, or reordered by the unreliable network links they must traverse. Previous efforts addressed the former by extending Datalog to include updates, key constraints, persistence and events, and the latter by assuming ordered and reliable delivery while ignoring delay. These details have a semantics outside Datalog, which increases the complexity of the language and its interpretation, and forces programmers to think operationally. We argue that the missing component from these previous languages is a notion of time.

I guess engineers just repeat the circle of life when it comes to designing things. I won’t comment on good/bad, but I do agree with the sentiment of the Daedalus authors about forcing programmers to think operationally. I’ll leave it as an opinion to the reader as to whether that’s good or bad, and whether it’s possible to innovate as Daedalus did in the Datalog community. I tend to prefer designs that allow explicit behavior when necessary, e.g. explicating time. And I’m not just trying to sound smart/theoretical. I have made the same practical points in the Entity Framework community for years regarding first principles of how to do “call-by-intention” compilers “intuitively”. I even recall when Erik Meijer was working on Volta , I e-mailed him and said Volta wasn’t intuitive because its call-by-intention semantics do not match real-world events in the web browser. I e-mailed Lex Spoon at Google on pretty much the same principles back when GWT was a hot thing. I mention these examples, because, to me, if you squint, mobile, distributed code architecture is very much like debating over whether you have ex ante rights to optimize a logic program that may have destructive updates / external events / delays.

I have some additional questions/observations, but I need to look through my own code where I might be doing this pattern to weigh in further with some pragmatic data points (e.g., “welp, nobody has reported this in four years so it musn’t be a big bug”). Thanks for sharing on twitter, @terrajobst ! I hope I’m not embarassing myself by chiming in - I am mostly self-taught computer engineer and don’t really engage often in discussion with big brains like the types Microsoft employs.

Cheers.

The change has also been around for years, since the beginning of .NET Core, and even with the extensive use cited as motivation, this is the only complaint I’ve heard about it.

Many companies haven’t switched over to .NET Core yet (mine still hasn’t). I expect as more people transition over there will be more complaints and bugs silently introduced.

@stephentoub Do you think it would be reasonable to provide a list of breaking changes people should expect (such as this one) to aid in their decision for when and how to switch to .NET Core?

Another way of opting out of the optimization without full-on ToList or ToArray is to create a cloaking enumerable wrapper which just passes through all enumeration (at the cost of one allocation):

public class CloakedEnumerable<T> : IEnumerable<T> {
    private readonly IEnumerable<T> _inner;

    public CloakedEnumerable(IEnumerable<T> inner) {
        _inner = inner;
    }

    public IEnumerator<T> GetEnumerator() {
        return _inner.GetEnumerator();
    }

    System.Collections.IEnumerator IEnumerable.GetEnumerator() {
        return _inner.GetEnumerator();
    }
}

public static class CloakedEnumerableExtensions {
    public static IEnumerable<T> AsCloakedEnumerable<T>(this IEnumerable<T> sequence) {
        return new CloakedEnumerable<T>(sequence);
    }
}

// later...
sequence.OrderBy(x => x).FirstOrDefault(x => x % 42 == 0) // does not opt out
sequence.OrderBy(x => x).AsCloakedEnumerable().FirstOrDefault(x => x % 42 == 0) // opts out

Note that this needs to be done after all subsequent ThenBy ordering is done since it doesn’t implement IOrderedEnumerable.

AsEnumerable() is a tempting alternative but just casts to IEnumerable<T>, and doesn’t prevent finding out the identity of the class that the optimization is based on.

@juliusfriedman Are you asking about portability across netstandard2.1/netstandard2.2 libraries where the end-user might be on .netcoreapp2.1/.netcoreapp3.1 LTS? The simplest workaround is to call ToList() or ToArray() between the OrderBy and FirstOrDefault. The drawback to the workaround is extra intermediary allocations. And yes, the reason it works is because you break the interface to OrderedEnumerable. Effectively, you go from “linq land” to “object land”.

Does LastOrDefault(predicate) exhibit this same issue?

FirstOrDefault(predicate) in .NET Framework iterated from the beginning of the enumerable and stopped the first time the predicate passed. That’s not the case for LastOrDefault, so I don’t know what the question is that you’re asking.

Does LastOrDefault() or FirstOrDefault() exhibit this same issue?

The concern with this issue was the invocations of the predicate. I don’t know what issue you’re referring to when discussing the overloads of FirstOrDefault and LastOrDefault that don’t take a predicate.

Does First() or Last() exhibit this same issue?

Same as above.

Does First(predicate) or Last(predicate) exhibit this same issue?

First(predicate), yes. Last(predicate), see my first answer above.

I assume OrderByDescending is also effected in the same way.

Yes.

@jzabroski

What I am trying to arrive it as “algebraic laws” for LINQ - or PEMDOS for LINQ. It seems most of us are voting that FirstOrDefault should not have a distributive property, and that OrderBy.FirstOrDefault disallows composing FirstOrDefault with OrderBy by distributing the FirstOrDefault aggregate over OrderBy operator. But then, what about other aggregate operators?

Realistically whilst that sounds a really worthy goal to me, I’m think that the time to do this has long gone - probably around 2006/7 when LINQ was being worked on - leave aside the fact that C#'s type system has no real way of detecting pure or impure functions.

The priority, I feel, should simply be ensuring non-breaking changes to users that migrate from Framework to Core so that they fall into the pit of success. IMHO. I’ll stop here though.

@JesperTreetop Yeah I knew I should have known this: I’ve written a full IQueryable linq provider myself but apparently my Mondays suffering brain didn’t realize this morning, Enumerable uses the same thing of course and due to that it can be used to optimize work that has been specified earlier as it wraps the sequence + arguments in a new specialized object. 😄

For the few people (like myself) who might be wondering how this optimization is implemented, considering in e.A().B() A is executed before B so how can B’s logic influence how A performs its logic: The sort operation specified with OrderBy is evaluated at enumeration of the sequence, not when the method is called, so no sorting takes place in the situation of e.OrderBy(predicate).FirstOrDefault(predicate2) till FirstOrDefault is called!

Very clever 😃 Nice trick to keep in the toolbox.

I wasn’t trying to make any new points or arguments in my comment. I was simply answering the questions Mark asked.

@stephentoub

We’ve identified scenarios in which the “performance optimization” seems to be pretty damaging to performance (if I understand correctly)

It’s about a) how many elements are being sorted, b) how expensive the predicate passed to FirstOrDefault is, and c) how likely it is that the element it’s searching for is found closer to the beginning of the results than later. An O(N log N) / O(N^2) algorithm became O(N), so if N is very large, the rest of the costs almost don’t matter. But as N drops in size, the cost of executing the predicate N times vs <= N times becomes more relevant.

Ok, tho wouldn’t you agree that this combined behavior (and thus room for optimization) is only expected if one thinks of: enumerable.ExtensionMethodA().ExtensionMethodB()..ExtensionMethodZ() as a single query which is to be optimized by a runtime?

What I mean by that is that if someone writes:

var r = someEnumerable.OrderBy(x=>x.SomeField).ThenBy(y=>y.SomeOtherField).FirstOrDefault()

one can reasonably see this as someEnumerable is enumerated & sorted first by OrderBy(), then the result of that is enumerated and sorted by ThenBy and then from that sequence the first element is taken, if any, so not as a single query like SELECT TOP 1 * FROM Table ORDER BY SomeField ASC, SomeOtherField ASC, which is then optimized by the RDBMS’ own optimizer.

So if something is optimized under the hood, that’s a free bonus, but if not, that’s fine too, simply because that’s how the design is, right? Like we expect the same from:

var x = someEnumerable.CustomMethodA().CustomMethodB().CustomMethodC();

I’d be surprised if someone would see this as ‘oh, that’s a single query on someEnumerable, where the 3 operations in the 3 methods are merged to do the work in the most optimal form’. They likely think: CustomMethodC() works on the result of CustomMethodB() (and thus will run after it) which will work on the results of CustomMethodA(). Why should a Linq extension method be an exception to that perception how behavior is going to be? (not that it shouldn’t, I agree 100% with your analysis regarding that it can be faster).

So, if someone wants a higher performing ‘ordering + take the first’ operation, isn’t it the reasonable thing to do to simply hand-roll that faster code manually, as only then you can combine operations from multiple method calls in a single iteration (like in your example). An alternative could be pushing this to a compiler optimization phase where it’s known where things can be optimized (like enumerable.OrderBy(predicate).Distinct().FirstOrDefault(predicate2) can be optimized to enumerable.OrderBy(predicate).FirstOrDefault(predicate2) which can be optimized to what we have today if predicate2 doesn’t contain a method call)

Of course, the API could offer a public way to do write this in linq without dropping down to a handrolled loop, by offering a specialized method for this (as ‘taking the max / min value from a set’ is a common operation) or a similar construct like with .join().select() where the projection in the select() method can be moved in some situations to the join avoiding a double projection.

As in the end it’s what one ‘thinks’ the code will do that is most important for maintainable software: that perception should match reality as that leads to less bugs.

@msloan

Question: Where does the responsibility lie for submitting breaking changes for that list?

Anyone can open such an issue – if we become aware of something requiring an issue we would open one, but stuff gets missed or overlooked and we would welcome other submissions. In the case of this issue, the process did not exist at the time the change was made.

What I expected to see here was the .net team realizing they had overlooked a situation (hey, we’re all human) and looking for a way to mitigate the side effect and at least make user code work like expected.

I think that is jumping to conclusions. I have immense respect for the minds considering this and overall trust the team to consider all angles before deciding what is right (or rather, most important).

I also value performance enhancements, but do not personally think that the implications of this one was fully considered before shipping it. I’m happy to see that there seems to be strong support for reverting the change, but I’ll accept the outcome regardless of which way it swings.

It would be useful to write Roslyn Analyser, which can detect side-effect operations in Linq. But I don’t believe that it is possible now with limited information from compiler.

What particular implementation detail could .NET Core developers rely on here that would make reverting a breaking change?

...OrderByDefault().FirstOrDefault(func) will currently call the lambda for all elements. It’s possible that evaluating this lambda has side effects for the elements it’s executed on that code happens to rely on. When not doing this anymore code might, for example, null reference because it didn’t properly check the underlying field for null. Before the bug wasn’t observed because the lambda happened to trigger the initialization.

Am I making stuff up? Yes. But have I seen bug reports like that? For sure. My point isn’t that this is such a big deal that we wouldn’t do it. I’m just pointing out that it has very real potential of breaking customer code too. The problem usually isn’t “will correct customer code continue to work”. It’s “will an app that worked before now no longer work”. In .NET Framework this bar prevented us from doing lots of things. In .NET Core we’re more willing to accept breaks, especially when they can’t be observed in correctly written code.

But the last part is the important piece here: what amounts to correct code is often in the eye of the beholder. I think what we’re saying is that relying on side effects in lambdas passed to LINQ is incorrect. And to a certain extent that’s your argument as well (when you asked that question above).

But that’s my point exactly. Why have special-case handling for First-and-friends that behaves differently from (a) what it used to be and (b) what is intuitive behavior from looking at the code.

Plain and simple: perf. Most perf optimizations are achieved by dissolving abstraction boundaries in order to shortcut. This optimization is no different from that.

The ask here is only that terminating expressions that do not visit all items “by nature” (so pretty much only First and FirstOrDefault) also behave as such.

Now we’re talking. That answers my question of why you think FirstOrDefault is being special. I actually think that’s an interesting point.