runtime: Bug with System.Linq.Enumerable.Average(this IEnumerable source) method
Seems there’s something wrong with the System.Linq.Enumerable.Average<long>(this IEnumerable<long> source)
method.
Let’s say, if I have a collection with only two large numbers (larger than 2^62), if I call this method it will throw System.OverflowException saying that “Arithmetic operation resulted in an overflow”.
I used the following code:
var numbers = new[] {8000000000000000002L, 7999999999999999998L}; Console.WriteLine(numbers.Average());
I also had a look at the source code in .NET Framework 4.8 and seems this problem is there, too. Is this a bug (I think it is), or do you have other considerations?
Anyway, fixing this problem is quite straightforward so I think maybe we should get rid of it.
Thanks.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 28 (18 by maintainers)
We definitely shouldn’t be switching to
double
. It is lossy for integers above2^53
.decimal
should be fine up to 29 digits, but will come at a perf cost.Exposing a new API that takes
BigInteger
would probably work as well.@stephentoub @Clockwork-Muse @tannergooding @eiriktsarpalis
[Updated] I realize that this works because all numbers are positive in my case, I did not find a simple solution, sigh…
Just want to share a naive implementation here that will probably be able to calculate the average of large numbers (in this case it’s Int64, but it can also be used in other types, Int128, Decimal, etc.):
It returns Int64 but the commented codes in last line can make it return double.
I just did some simple tests and it seems the performance is almost same as the Decimal version, following is the result which calculate the last 100000001 largest Int64 numbers, both version took about 2 secs:
Number set: RangeInt64(Int64.MaxValue - 100000000, 100000001)
2020-11-28T19:01:51.6389464+08:00 This Version: 9223372036804775807 2020-11-28T19:01:53.6810632+08:00 Decimal Version: 9223372036804775807 2020-11-28T19:01:55.7211799+08:00
I also test this algorithm with small numbers and the performance is about 2.5x slower than the current average algorithm.
But this is really a naive implementation and need to be improved as it overflows for
{Int64.MinValue, Int64.MaxValue}
. Really frustrating for not having it as COMPLETE. 😦Thanks for your time.
I’m just trying to ensure the tradeoffs of fixing the bug, for the given API, are well understood. The maximum error today is limited by the fact that you can’t have a base sum over
long.MaxValue
, extending this will increase the maximum error possible for the result returned from the API and will likely just better expose an existing issue with the API that people will have to deal with instead.For example, lets assume you are summing
long.MaxValue - 10
throughlong.MaxValue
; this equals101457092405402533822
and the average is9223372036854775802
.However, the nearest representable
double
to the sum is101457092405402533888
(off by 66) and(double)sum / count
(to compute the nearest representable average) then ends up being9223372036854775808
(off by 6).This rounding error will only increase as larger sums become allowed for the given algorithm and will quickly become a problem for anyone dealing with large
long
(e.g. summinglong.MaxValue - 1000
throughlong.MaxValue
ends up being off by over 500).Just switching to
decimal sum
incurs a 4x perf hit on my machine. https://github.com/stephentoub/corefx/commit/5c33656afafc445d467b43c91c21ecbef229a0e7, which switches to decimal on overflow, incurs an ~15% perf hit on my machine.4x slower is a huge regression to take for a scenario that, to my knowledge, has inconvenienced very few developers.
One approach we could take is https://github.com/stephentoub/corefx/commit/5c33656afafc445d467b43c91c21ecbef229a0e7, which on my machine incurs up to a 15% hit. Much better than 4x, but I’m not sure it’s worth it.
@cston, @pentp, @tannergooding, any opinions?