rails: ActiveRecord distinct with sum produces unexpected results

Steps to reproduce

class Foo < ActiveRecord::Base
  (has "price" field)
end

Foo.create(price: 5)
Foo.create(price: 12)
Foo.create(price: 5)

total = Foo.distinct.sum(:price)

Expected behavior

Returned value should be the sum of all prices, as there are three distinct foo records. (total == 22)

Actual behavior

Returned value is the sum of all distinct prices. (total == 17)

System configuration

Rails version: 5.2.0

Ruby version: 2.5.1

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 7
  • Comments: 48 (29 by maintainers)

Most upvoted comments

I feel it’s only ambiguous in theory as one expects to be able to read a statement from left to right. Almost all ruby code I can think of adheres to that.

To me, this feels like common sense:

relation.distinct => distinct relation by id relation.distinct.sum(:credit_limit) => sum of credit_limit of the rows given by the statement above relation.distinct(:credit_limit).sum(:credit_limit) => sum of distinct credit limits

If you would rethink this from scratch, wouldn’t you want this kind of api? Why would someone think the second statement will return the sum of distinct credit_limits? I suppose that would only be the case for people familiar with this issue.

Another solution may be to require the distinct call to have a column name param. That way every call is explicit (and would, just like the examples above, allow to be read from left to right).

Just ran into this issue today while upgrading from Rails 4.2.

For what it’s worth, I’m with @CodingAnarchy. I do understand the logic behind the conversation in this thread. I would still like to add my point of view as well: the current way is counter intuitive to the point that even an error would need to be raised. Or … the following reasoning is wrong and I should change several parts of our codebase and try to think about these relations and my life choices 😃 in a different way:

  • company.line_items is supposed to return the line_items of the company. If needed, a distinct is added to the association to make sure the statement returns unique line_items. The distinct part is hidden as an implementation detail and does not leak to other parts of the application to make sure company.line_items always returns the intuitive records.
  • company.line_items.sum(:qty) is supposed to sum the qty’s of the line_items that would otherwise be returned by company.line_items. Thus company.line_items.sum(:qty) and company.line_items.sum(&:qty) should return the same amount.
  • Even without the hidden distinct call the intuitive meaning of company.line_items.distinct.sum(:qty) would be the sum of the qties of the results returned by company.line_items.distinct.

If this isn’t possible, I’m almost akin to state that an error should be raised.

(Calculating the sum of distinct qties is a more exotic requirement. Intuitively I would suggest: company.line_items.sum("DISTINCT qty").)

@darwin67 would you build this another way?

In the real-world scenario, this was the result:

company.line_items.sum(:amount)
=> 4419.9
company.line_items.distinct(false).sum(:amount)
=> 143424.3
company.line_items.sum(&:amount)
=> 56762.3

This should not be a “gotcha” as it is now, and I hope the AR team comes around to resolving it to change the SQL generated to combine these more appropriately.

you were lucky an unintended behavior worked for you, and it’s unfortunate that the correct behavior broke your use case

I think it broke a lot of people’s use cases, and the intuitive behavior of the methodology. In this particular case, if the DISTINCT SUM([field]) was the wrong behavior, then it should have explicitly been redirected to do a GROUP BY id to get the concept across. Or even automatically implementing a subquery would be preferable.

I can see why the SQL only happened to work in my case, or at least appeared to, and was otherwise broken, but it isn’t “fixed” either, in the sense that the intuitive result of foos.distinct.sum(:bar) is not the sum of all bars on distinct foos.

In general it shouldn’t, but in the case of combining a COUNT or SUM with DISTINCT, it should since there’s no way the query builder can tell what you are trying to do with the values.

I disagree. I would find this to be a trivial thing that we should be able to understand.

foos.distinct.sum(:bar) should give me the sum of that bar field on all the distinct records in the collection.

No it isn’t (or shouldn’t) be a common thing to use DISTINCT in the first place even without nesting queries or doing joins.

has_many :through can often return multiple join records. Joins are commonplace, and there are numerous locations in multiple apps that I have used a distinct to get singular records for an item. Particularly common is the foos.join(:bars).distinct pattern to get “all foos with attached bars”.

See:

https://stackoverflow.com/questions/5846638/how-to-display-unique-records-from-a-has-many-through-relationship

https://www.lugolabs.com/articles/distinct-active-records-in-has_many-through-associations

Let’s say this was working like you wanted and ActiveRecord somehow figures that you want your return value to be distinct by id, the query below will likely be what it generates

The example you gave is not necessarily the requirement of the query builder. If the query builder can identify that I have used both distinct and sum, it could build a query that actually gives me the sum of the distinct records. Even if it does it as a SELECT SUM(bar) FROM ([previously built DISTINCT query]).

First, I’m sorry it doesn’t fully fit your needs, but an abstraction layer will always have that problem of not being able to scratch all of your itchy places.

It’s not merely that it doesn’t fit my needs. Let me explain in a hypothetical scenario. Let’s say I have some has_many :bars association on a Foo class. This works just fine. Now I want to add a has_many :bazes, through: :bars. Great. But when I get foo.bazes, I could get duplicate records that show up on my use of the collection. That’s not good, but it’s easy to solve - I put -> { distinct } in the association. Now it will give me exactly one instance of each baz, even if it is associated with multiple bars. But now, suddenly, my DB sums (and counts, and averages, etc.) are giving me completely different answers!

have to utilize the more inefficient solutions somewhere

I recognize that there are some tradeoffs to be made, but in this case, there is essentially no use case where I would want to combine a distinct on some association or join with sum in this manner. I further see no reason why foos.distinct.sum(:bar) should not give me the “sum over the distinct records”. It seems further trivial to do a foos.sum("DISTINCT foos.bar") if I actually wanted the other for some reason, and no real reason why a nicer syntax could not be added to the query builder for that - something like foos.sum(distinct: :bar) or foos.sum(:bar, distinct: true).

Second, Rails is maintained and improved by the public, so you can always submit a pull request for changes if you’re not satisfied.

This would entail a lot of work to remove the change and make it work the way I think it should, and if I can’t even get y’all here to see the problem, it’s not something I would deem worth doing just to get it shot down.

It will generate 2 different queries but databases should (hopefully) be smart enough to cache that value so it doesn’t end up querying it multiple times.

If you did it without pluck (say select(:id) instead), it would just generate a subquery. But this is besides the point. I shouldn’t need a workaround to avoid using a framework method because that method can suddenly change to do wildly different behavior depending on whether it is chained with another method that might not even be in the same part of the code.

Essentially, as a result of this issue, I am concluding (a) never use distinct (use group_by instead?), (b) never use the DB sum, © forget Rails and the DB and do everything in Ruby if at all possible. And you’re telling me that “well, no framework is perfect”. To me, these are serious shortfalls introduced by this change.

And it’s not merely that sum is unintuitive, but that it drastically changes behavior based on other code around it.

Just ran into this as well … very surprising and annoying.

What in the world is even the use case for summing only distinct fields?

The use case for summing a field over a distinct set of records is pretty clear and is what I would expect from reading Model.distinct.sum(field)

I think others have already summarized well. I just wanted to add another take and see if it works to convince the maintainers. I also wouldn’t want this marked as stale or anything.

I’m unsure why it is up for debate that calling distinct on the relation is expected to do anything but return a distinct set of that relation and then calling any aggregate calculation on that set would perform the calculation on the distinct set.

I think the above follows set theory, as well as functional theory (piping results).

To call distinct on the calculated element of the new set to break with reasonable expectations and any math and functional theory I’ve seen.

I would expect people that want to actually perform a distinct on the calculation to do something like relation.sum("DISTINCT element")

I’m sure others have worked around this themselves already but we have resorted to using pluck e.g. scope.distinct.pluck("SUM(...)").first you may need to unscope some things (typically we unscope(:order)).

Anyway, as always this is all free so no anger or nothing, it just seems like a strange choice to dig in on keeping the current behaviour. The current behaviour so clearly seems like a bug.

In this issue report, I see it is expected that distinct is applied to :id.

I would actually have expected that relation.distinct.sum(:credit_limit) would produce something like:

SELECT SUM(credit_limit) FROM (SELECT DISTINCT * FROM [relation]) as distinct_relation

with that then optimized for the particular database you are actually working with.

I can only say that the SQL query in 4.2.x did what I expected (at least on Postgres) and did indeed give me the sum of distinct records in my table.

Furthermore, I would expect that the chaining of scopes would not be affected by sum: i.e., that location.charges.sum(:amount) should be equal to location.charges.sum(&:amount).

I get that you view this as a fix, but I don’t see it that way at all. To me, this has introduced an essentially random behavior that easily can lead to surprise values. My sum can now change in many places because I added a distinct on a join scope somewhere else, which has to be common for has_many :through.

From my perspective, this “fix” makes the SQL sum unpredictable and therefore useless. I am also not satisfied with being told I have to use a more inefficient sum in Ruby or raw SQL - not having to resort to those two options is why I use Rails in the first place.

In my opinion, this fix merely cross-contaminates the different aspects of the problem.

Just ran into the same issue. Took me a while to understand that .distinct.sum(:attr) will change the way .distinct works usually.

still seems to be an issue here, it this being reopened?

@kamipo thoughts on re-opening this?

Environment

ruby: 3.1.2p20 rails: 7.0.7.2

Steps to reproduce

objects: { id: 0, value: 10.0 }, { id: 1. value: 10.0 }

Object.where(id: [0 ,1]).distinct.sum(:value)

Expected outcome

20.0

Actual outcome

10.0

SQL query:

SELECT SUM(DISTINCT "objects"."value") FROM "objects" WHERE "objects"."id" IN ($1, $2)  [["id", 0], ["id", 1]]

I didn’t expect that the duplicate value gets eliminated, took me quite some time to figure it out.

I’m sorry, I didn’t really post to respond to you specifically, @ozzyaaron , I should’ve been clear about that.

I agree with the piping principle, but I don’t see distinct changing due to backwards compatibility and since there isn’t consensus.

So, would it be better to define something new rather than changing distinct?

has_many_distinct ..?  # noone would not read this as distinct records.
has_many, -> { distinct_records }, ...

I would help build, but I’d prefer getting shot down as early as possible 😄

Hmm but in the case that I am proposing, there is an explicit select while the current relation.distinct.sum(:credit_limit) is ambiguous since we don’t know what column the distinct is applied to. The current behavior is that distinct is applied to :credit_limit but people are actually expecting :id.

I’ve yet to look how hard this will be to implement but relation.distinct(:id).sum(:credit_limit) feels like more intuitive to me. It is clear which column distinct is called on and which column is being sum.

why? why this issue actual but is already closed?

The current behavior is that distinct is applied to :credit_limit but people are actually expecting :id.

In this issue report, I see it is expected that distinct is applied to :id.

But I don’t think no one expects relation.distinct.sum(:credit_limit) is applied to :credit_limit which the behavior that existed for a long time.

But if the case of relation.distinct.sum(:credit_limit) (we don’t know if a relation has select values or not), a user may expect the current behavior, changing to make the subquery will be considered a breaking change.

What I said in the latter sentence is that changing behavior to support that is considered as a breaking change for users who expect the current behavior.

Maybe we need a new way like sum(:credit_limit, distinct: true) to avoid the ambiguous and deprecate the current behavior.

We had some methods like:

def self.total_qty
  sum(:qty)
end

To get around this problem (in my view still definitely a bug) we transformed the method without a need for pluck:

def self.total_qty
  unscoped.where(id: select(:id)).sum(:qty)
end

@CodingAnarchy Thank you, I can verify that it works in v4.2 now. Just one more thing to clarify, do you expect the following to work?

person_a = Person.create!(charges: [charge_a])
person_b = Person.create!(charges: [charge_a, charge_c])

# => amt * 2 + second_amt == location.charges.sum(:amount) should be true?
person_a = Person.create!(charges: [charge_a, charge_b])
person_b = Person.create!(charges: [charge_a, charge_c])

# => amt * 3 + second_amt == location.charges.sum(:amount) should be true?

That’s what I see in the script you posted initially. Assuming you want to be distinct by both people and charges?

@CodingAnarchy

There is nothing wrong in your example,

When you have three prices [5,12,5] actually you have 2 distinct values [5,12] and sum of them will be 17 exactly.

Maybe there is some misunderstanding about distinct.