rails: ActiveRecord::Relation .size regression with distinct select
This test case works as written in Rails 3.2, but fails on Rails 4.1.4 and master:
unless File.exist?('Gemfile')
File.write('Gemfile', <<-GEMFILE)
source 'https://rubygems.org'
gem 'rails', github: 'rails/rails'
gem 'arel', github: 'rails/arel'
gem 'rack', github: 'rack/rack'
gem 'sqlite3'
GEMFILE
system 'bundle'
end
require 'bundler'
Bundler.setup(:default)
require 'active_record'
require 'minitest/autorun'
require 'logger'
# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: 'sqlite3', database: ':memory:')
ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Schema.define do
create_table :posts do |t|
t.integer :foo
end
end
class Post < ActiveRecord::Base
end
class BugTest < Minitest::Test
def test_size_bug
Post.create! foo: 1
Post.create! foo: 2
Post.create! foo: 2
assert_equal [1, 2], Post.select(:foo).uniq.map(&:foo)
assert_equal 2, Post.select(:foo).uniq.length
assert_equal 2, Post.select(:foo).uniq.count
# Rails 4.1+ selects COUNT(DISTINCT id) instead of COUNT(DISTINCT foo) as above:
assert_equal 2, Post.select(:foo).uniq.size
end
end
About this issue
- Original URL
- State: closed
- Created 10 years ago
- Reactions: 1
- Comments: 16 (12 by maintainers)
That being said, I’d love to take a swing at implementing this change. If you think this makes sense, just give me the green light 😄
This is hairy. It took me a while to to understand what’s going on. It was broken by the fix of #13648 (in 968c581ea34b5236af14805e6a77913b1cb36238). It seems to me that there is a deeper problem in
count
. I will now share the results of my “investigation” and then propose a solution.Some data first
In order to illustrate the problem, let me show the behavior of different combinations of
select
,distinct
andcount
. I’m using this script to extract the results. Here’s the data:Basically, we’re doing sports for four weeks. We exercise on the week days and take a break in the weekend.
Here’s how the different combinations of
select
,distinct
andcount
behave. The row label is applied first, followed by the column label. That is, the second cell on the second row is the result ofActivity.select(:name).count
:count
count(:all)
size
to_a.size
select(:name)
select(:name).distinct
all
distinct
select(:name, :day)
select(:name, :day).distinct
Note the two errors – I’ll get back to them later. I’ve put emphasis on the ones related to this issue. Let me start with
select(:name).distinct
..size
vs..to_a.size
Note how
.size
and.to_a.size
return different results forselect(:name).distinct
.#size
returns 28 when the relation is not loaded and 4 when it is. This seems to break an important invariant –#size
should return the same result whether loaded or not. To be extra explicit, here’s a test case that will fail:I initially thought about modifying
#size
to do.count
instead of.count(:all)
for thedistinct
case only. Unfortunately, that won’t work because of what follows.The two meanings of
select(:name).distinct
The caller of
select(:name).distinct
might mean two different things:COUNT(DISTINCT name)
- the number of unique not-NULL
names performed (3 in this case)COUNT(*) FROM (SELECT DISTINCT name)
- the number of unique names, including theNULL
(4 in this case)Note that
select(:name).distinct.count
returns 3 (which is the first meaning), butselect(:name).distinct.to_a
return 4 different records (which is the second):I think both should return 4. That way
#count
will mean “number of records in the relation” and it will be consistent with.size
and.to_a.size
. The user can always access the old behavior withActivity.count('DISTINCT name')
.To implement this, we’ll need a nested query (the
AS
is required in PostgreSQL):It might seem uncool, but let’s get back to those errors.
select(:name, :day)
In this case, neither
.count
nor.distinct.count
produce valid queries:To be fair, the second is a valid query in MySQL, but not in the other databases. So if
select(:name, :day).distinct
is to be working, it needs to make a nested query:Note that this is the second meaning of
select(:name, day).distinct
as described previously – countingNULL
values. If we want to count only not-NULL
s, it will have to be something in the like this:But as mentioned earlier, I think it’s better to have
distinct
count theNULL
values.count
vs.count(:all)
Let’s put the distinct issue aside for a bit and take a look at
select.count
. Currently:select(:name).count
returns 20select(:name).to_a.size
returns 28select(:name).size
returns 28select(:name).count(:all)
return 28select(:name, :age).count
raises an errorselect(:name, :age).count(:all)
returns 28Probably not what one would expect.
I can think of only one reason to have that – allowing the users to do
Activity.select(:name).count
and mean “excludeNULL
values”. I don’t think this is a good idea, because (1) they can already do it with.count('name')
or.count('DISTINCT name'
) and (2) it can’t be generalized for multiple columns.I’m proposing to get rid of this functionality and have
count()
behave always ascount(:all)
. That way we can get rid ofcount(:all)
as a special case.In summary
Here’s a list of the proposed changes:
select(:name).distinct.count
countNULL
values.select(:name).distinct.count
be implemented as a sub-query..count()
to include not-NULL
values specified in.select()
.count('name')
or.count('DISTINCT name')
instead.count
andcount(:all)
.This will lead to the following table:
count
count(:all)
size
to_a.size
select(:name)
select(:name).distinct
all
distinct
select(:name, :day)
select(:name, :day).distinct
The values that will change are in bold. Note how each row has the same value, which I think is pretty neat.
It’s a breaking change, but I think it makes things consistent.
That being said, I’d love to take a swing at implementing this. If you think this makes sense, just give me the green light 😄
/cc @senny
Fixed by c6cd9a59f200863ccfe8ad1d9c5a8876c39b9c5c.