runtime: Bug with System.Linq.Enumerable.Average(this IEnumerable source) method

Seems there’s something wrong with the System.Linq.Enumerable.Average<long>(this IEnumerable<long> source) method.

Let’s say, if I have a collection with only two large numbers (larger than 2^62), if I call this method it will throw System.OverflowException saying that “Arithmetic operation resulted in an overflow”.

I used the following code: var numbers = new[] {8000000000000000002L, 7999999999999999998L}; Console.WriteLine(numbers.Average());

I also had a look at the source code in .NET Framework 4.8 and seems this problem is there, too. Is this a bug (I think it is), or do you have other considerations?

Anyway, fixing this problem is quite straightforward so I think maybe we should get rid of it.

Thanks.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 28 (18 by maintainers)

Most upvoted comments

We definitely shouldn’t be switching to double. It is lossy for integers above 2^53. decimal should be fine up to 29 digits, but will come at a perf cost.

Exposing a new API that takes BigInteger would probably work as well.

@stephentoub @Clockwork-Muse @tannergooding @eiriktsarpalis

[Updated] I realize that this works because all numbers are positive in my case, I did not find a simple solution, sigh…

Just want to share a naive implementation here that will probably be able to calculate the average of large numbers (in this case it’s Int64, but it can also be used in other types, Int128, Decimal, etc.):

Int64 ComputeAvarage(IEnumerable<Int64> data)
{
    if (data == null)
        throw new ArgumentNullException("data");

    if (!data.Any()) return 0L;

    Int64 cnt = 1;
    Int64 avg = data.First(), carry = 0;

    foreach (var item in data.Skip(1))
    {
        ++cnt;
        carry += item - avg;

        if (carry <= -cnt || cnt <= carry)
        {
            var temp = carry / cnt;
            avg += temp;
            carry -= temp * cnt;
        }
        if (carry < 0)
        {
            --avg;
            carry += cnt;
        }
    }

    return avg;// +((Double)carry / cnt);
}

It returns Int64 but the commented codes in last line can make it return double.

I just did some simple tests and it seems the performance is almost same as the Decimal version, following is the result which calculate the last 100000001 largest Int64 numbers, both version took about 2 secs:

Number set: RangeInt64(Int64.MaxValue - 100000000, 100000001)

2020-11-28T19:01:51.6389464+08:00 This Version: 9223372036804775807 2020-11-28T19:01:53.6810632+08:00 Decimal Version: 9223372036804775807 2020-11-28T19:01:55.7211799+08:00

I also test this algorithm with small numbers and the performance is about 2.5x slower than the current average algorithm.

But this is really a naive implementation and need to be improved as it overflows for {Int64.MinValue, Int64.MaxValue}. Really frustrating for not having it as COMPLETE. 😦

Thanks for your time.

I’m just trying to ensure the tradeoffs of fixing the bug, for the given API, are well understood. The maximum error today is limited by the fact that you can’t have a base sum over long.MaxValue, extending this will increase the maximum error possible for the result returned from the API and will likely just better expose an existing issue with the API that people will have to deal with instead.

For example, lets assume you are summing long.MaxValue - 10 through long.MaxValue; this equals 101457092405402533822 and the average is 9223372036854775802.

However, the nearest representable double to the sum is 101457092405402533888 (off by 66) and (double)sum / count (to compute the nearest representable average) then ends up being 9223372036854775808 (off by 6).

This rounding error will only increase as larger sums become allowed for the given algorithm and will quickly become a problem for anyone dealing with large long (e.g. summing long.MaxValue - 1000 through long.MaxValue ends up being off by over 500).

decimal should be fine up to 29 digits, but will come at a perf cost.

Just switching to decimal sum incurs a 4x perf hit on my machine. https://github.com/stephentoub/corefx/commit/5c33656afafc445d467b43c91c21ecbef229a0e7, which switches to decimal on overflow, incurs an ~15% perf hit on my machine.

I think the Convert to Decimal approach is good enough here.

4x slower is a huge regression to take for a scenario that, to my knowledge, has inconvenienced very few developers.

One approach we could take is https://github.com/stephentoub/corefx/commit/5c33656afafc445d467b43c91c21ecbef229a0e7, which on my machine incurs up to a 15% hit. Much better than 4x, but I’m not sure it’s worth it.

@cston, @pentp, @tannergooding, any opinions?