rails: ActiveRecord distinct with sum produces unexpected results
Steps to reproduce
class Foo < ActiveRecord::Base
(has "price" field)
end
Foo.create(price: 5)
Foo.create(price: 12)
Foo.create(price: 5)
total = Foo.distinct.sum(:price)
Expected behavior
Returned value should be the sum of all prices, as there are three distinct foo
records. (total == 22
)
Actual behavior
Returned value is the sum of all distinct
prices. (total == 17
)
System configuration
Rails version: 5.2.0
Ruby version: 2.5.1
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 7
- Comments: 48 (29 by maintainers)
I feel it’s only ambiguous in theory as one expects to be able to read a statement from left to right. Almost all ruby code I can think of adheres to that.
To me, this feels like common sense:
relation.distinct
=> distinct relation by idrelation.distinct.sum(:credit_limit)
=> sum of credit_limit of the rows given by the statement aboverelation.distinct(:credit_limit).sum(:credit_limit)
=> sum of distinct credit limitsIf you would rethink this from scratch, wouldn’t you want this kind of api? Why would someone think the second statement will return the sum of distinct credit_limits? I suppose that would only be the case for people familiar with this issue.
Another solution may be to require the distinct call to have a column name param. That way every call is explicit (and would, just like the examples above, allow to be read from left to right).
Just ran into this issue today while upgrading from Rails 4.2.
For what it’s worth, I’m with @CodingAnarchy. I do understand the logic behind the conversation in this thread. I would still like to add my point of view as well: the current way is counter intuitive to the point that even an error would need to be raised. Or … the following reasoning is wrong and I should change several parts of our codebase and try to think about these relations and my life choices 😃 in a different way:
company.line_items
is supposed to return the line_items of the company. If needed, a distinct is added to the association to make sure the statement returns unique line_items. The distinct part is hidden as an implementation detail and does not leak to other parts of the application to make surecompany.line_items
always returns the intuitive records.company.line_items.sum(:qty)
is supposed to sum the qty’s of the line_items that would otherwise be returned bycompany.line_items
. Thuscompany.line_items.sum(:qty)
andcompany.line_items.sum(&:qty)
should return the same amount.company.line_items.distinct.sum(:qty)
would be the sum of the qties of the results returned bycompany.line_items.distinct
.If this isn’t possible, I’m almost akin to state that an error should be raised.
(Calculating the sum of distinct qties is a more exotic requirement. Intuitively I would suggest:
company.line_items.sum("DISTINCT qty")
.)@darwin67 would you build this another way?
In the real-world scenario, this was the result:
This should not be a “gotcha” as it is now, and I hope the AR team comes around to resolving it to change the SQL generated to combine these more appropriately.
I think it broke a lot of people’s use cases, and the intuitive behavior of the methodology. In this particular case, if the
DISTINCT SUM([field])
was the wrong behavior, then it should have explicitly been redirected to do aGROUP BY id
to get the concept across. Or even automatically implementing a subquery would be preferable.I can see why the SQL only happened to work in my case, or at least appeared to, and was otherwise broken, but it isn’t “fixed” either, in the sense that the intuitive result of
foos.distinct.sum(:bar)
is not the sum of allbars
on distinctfoos
.I disagree. I would find this to be a trivial thing that we should be able to understand.
foos.distinct.sum(:bar)
should give me the sum of thatbar
field on all the distinct records in the collection.has_many :through
can often return multiple join records. Joins are commonplace, and there are numerous locations in multiple apps that I have used adistinct
to get singular records for an item. Particularly common is thefoos.join(:bars).distinct
pattern to get “all foos with attached bars”.See:
https://stackoverflow.com/questions/5846638/how-to-display-unique-records-from-a-has-many-through-relationship
https://www.lugolabs.com/articles/distinct-active-records-in-has_many-through-associations
The example you gave is not necessarily the requirement of the query builder. If the query builder can identify that I have used both distinct and sum, it could build a query that actually gives me the sum of the distinct records. Even if it does it as a
SELECT SUM(bar) FROM ([previously built DISTINCT query])
.It’s not merely that it doesn’t fit my needs. Let me explain in a hypothetical scenario. Let’s say I have some
has_many :bars
association on aFoo
class. This works just fine. Now I want to add ahas_many :bazes, through: :bars
. Great. But when I getfoo.bazes
, I could get duplicate records that show up on my use of the collection. That’s not good, but it’s easy to solve - I put-> { distinct }
in the association. Now it will give me exactly one instance of eachbaz
, even if it is associated with multiplebars
. But now, suddenly, my DB sums (and counts, and averages, etc.) are giving me completely different answers!I recognize that there are some tradeoffs to be made, but in this case, there is essentially no use case where I would want to combine a
distinct
on some association or join withsum
in this manner. I further see no reason whyfoos.distinct.sum(:bar)
should not give me the “sum over the distinct records”. It seems further trivial to do afoos.sum("DISTINCT foos.bar")
if I actually wanted the other for some reason, and no real reason why a nicer syntax could not be added to the query builder for that - something likefoos.sum(distinct: :bar)
orfoos.sum(:bar, distinct: true)
.This would entail a lot of work to remove the change and make it work the way I think it should, and if I can’t even get y’all here to see the problem, it’s not something I would deem worth doing just to get it shot down.
If you did it without pluck (say
select(:id)
instead), it would just generate a subquery. But this is besides the point. I shouldn’t need a workaround to avoid using a framework method because that method can suddenly change to do wildly different behavior depending on whether it is chained with another method that might not even be in the same part of the code.Essentially, as a result of this issue, I am concluding (a) never use
distinct
(usegroup_by
instead?), (b) never use the DBsum
, © forget Rails and the DB and do everything in Ruby if at all possible. And you’re telling me that “well, no framework is perfect”. To me, these are serious shortfalls introduced by this change.And it’s not merely that
sum
is unintuitive, but that it drastically changes behavior based on other code around it.Just ran into this as well … very surprising and annoying.
What in the world is even the use case for summing only distinct fields?
The use case for summing a field over a distinct set of records is pretty clear and is what I would expect from reading
Model.distinct.sum(field)
I think others have already summarized well. I just wanted to add another take and see if it works to convince the maintainers. I also wouldn’t want this marked as stale or anything.
I’m unsure why it is up for debate that calling distinct on the relation is expected to do anything but return a distinct set of that relation and then calling any aggregate calculation on that set would perform the calculation on the distinct set.
I think the above follows set theory, as well as functional theory (piping results).
To call distinct on the calculated element of the new set to break with reasonable expectations and any math and functional theory I’ve seen.
I would expect people that want to actually perform a distinct on the calculation to do something like
relation.sum("DISTINCT element")
I’m sure others have worked around this themselves already but we have resorted to using
pluck
e.g.scope.distinct.pluck("SUM(...)").first
you may need to unscope some things (typically weunscope(:order)
).Anyway, as always this is all free so no anger or nothing, it just seems like a strange choice to dig in on keeping the current behaviour. The current behaviour so clearly seems like a bug.
I would actually have expected that
relation.distinct.sum(:credit_limit)
would produce something like:with that then optimized for the particular database you are actually working with.
I can only say that the SQL query in 4.2.x did what I expected (at least on Postgres) and did indeed give me the sum of distinct records in my table.
Furthermore, I would expect that the chaining of scopes would not be affected by sum: i.e., that
location.charges.sum(:amount)
should be equal tolocation.charges.sum(&:amount)
.I get that you view this as a fix, but I don’t see it that way at all. To me, this has introduced an essentially random behavior that easily can lead to surprise values. My
sum
can now change in many places because I added adistinct
on a join scope somewhere else, which has to be common forhas_many :through
.From my perspective, this “fix” makes the SQL
sum
unpredictable and therefore useless. I am also not satisfied with being told I have to use a more inefficient sum in Ruby or raw SQL - not having to resort to those two options is why I use Rails in the first place.In my opinion, this fix merely cross-contaminates the different aspects of the problem.
Just ran into the same issue. Took me a while to understand that
.distinct.sum(:attr)
will change the way.distinct
works usually.still seems to be an issue here, it this being reopened?
@kamipo thoughts on re-opening this?
Environment
ruby: 3.1.2p20 rails: 7.0.7.2
Steps to reproduce
objects: { id: 0, value: 10.0 }, { id: 1. value: 10.0 }
Object.where(id: [0 ,1]).distinct.sum(:value)
Expected outcome
20.0
Actual outcome
10.0
SQL query:
I didn’t expect that the duplicate value gets eliminated, took me quite some time to figure it out.
I’m sorry, I didn’t really post to respond to you specifically, @ozzyaaron , I should’ve been clear about that.
I agree with the piping principle, but I don’t see
distinct
changing due to backwards compatibility and since there isn’t consensus.So, would it be better to define something new rather than changing distinct?
I would help build, but I’d prefer getting shot down as early as possible 😄
Hmm but in the case that I am proposing, there is an explicit
select
while the currentrelation.distinct.sum(:credit_limit)
is ambiguous since we don’t know what column the distinct is applied to. The current behavior is thatdistinct
is applied to:credit_limit
but people are actually expecting:id
.I’ve yet to look how hard this will be to implement but
relation.distinct(:id).sum(:credit_limit)
feels like more intuitive to me. It is clear which columndistinct
is called on and which column is being sum.why? why this issue actual but is already closed?
In this issue report, I see it is expected that
distinct
is applied to:id
.But I don’t think no one expects
relation.distinct.sum(:credit_limit)
is applied to:credit_limit
which the behavior that existed for a long time.What I said in the latter sentence is that changing behavior to support that is considered as a breaking change for users who expect the current behavior.
Maybe we need a new way like
sum(:credit_limit, distinct: true)
to avoid the ambiguous and deprecate the current behavior.We had some methods like:
To get around this problem (in my view still definitely a bug) we transformed the method without a need for
pluck
:@CodingAnarchy Thank you, I can verify that it works in
v4.2
now. Just one more thing to clarify, do you expect the following to work?That’s what I see in the script you posted initially. Assuming you want to be distinct by both people and charges?
@CodingAnarchy
There is nothing wrong in your example,
When you have three prices [5,12,5] actually you have 2 distinct values [5,12] and sum of them will be 17 exactly.
Maybe there is some misunderstanding about distinct.