rails: ActiveRecord::Relation .size regression with distinct select
This test case works as written in Rails 3.2, but fails on Rails 4.1.4 and master:
unless File.exist?('Gemfile')
File.write('Gemfile', <<-GEMFILE)
source 'https://rubygems.org'
gem 'rails', github: 'rails/rails'
gem 'arel', github: 'rails/arel'
gem 'rack', github: 'rack/rack'
gem 'sqlite3'
GEMFILE
system 'bundle'
end
require 'bundler'
Bundler.setup(:default)
require 'active_record'
require 'minitest/autorun'
require 'logger'
# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: 'sqlite3', database: ':memory:')
ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Schema.define do
create_table :posts do |t|
t.integer :foo
end
end
class Post < ActiveRecord::Base
end
class BugTest < Minitest::Test
def test_size_bug
Post.create! foo: 1
Post.create! foo: 2
Post.create! foo: 2
assert_equal [1, 2], Post.select(:foo).uniq.map(&:foo)
assert_equal 2, Post.select(:foo).uniq.length
assert_equal 2, Post.select(:foo).uniq.count
# Rails 4.1+ selects COUNT(DISTINCT id) instead of COUNT(DISTINCT foo) as above:
assert_equal 2, Post.select(:foo).uniq.size
end
end
About this issue
- Original URL
- State: closed
- Created 10 years ago
- Reactions: 1
- Comments: 16 (12 by maintainers)
That being said, I’d love to take a swing at implementing this change. If you think this makes sense, just give me the green light 😄
This is hairy. It took me a while to to understand what’s going on. It was broken by the fix of #13648 (in 968c581ea34b5236af14805e6a77913b1cb36238). It seems to me that there is a deeper problem in
count. I will now share the results of my “investigation” and then propose a solution.Some data first
In order to illustrate the problem, let me show the behavior of different combinations of
select,distinctandcount. I’m using this script to extract the results. Here’s the data:Basically, we’re doing sports for four weeks. We exercise on the week days and take a break in the weekend.
Here’s how the different combinations of
select,distinctandcountbehave. The row label is applied first, followed by the column label. That is, the second cell on the second row is the result ofActivity.select(:name).count:countcount(:all)sizeto_a.sizeselect(:name)select(:name).distinctalldistinctselect(:name, :day)select(:name, :day).distinctNote the two errors – I’ll get back to them later. I’ve put emphasis on the ones related to this issue. Let me start with
select(:name).distinct..sizevs..to_a.sizeNote how
.sizeand.to_a.sizereturn different results forselect(:name).distinct.#sizereturns 28 when the relation is not loaded and 4 when it is. This seems to break an important invariant –#sizeshould return the same result whether loaded or not. To be extra explicit, here’s a test case that will fail:I initially thought about modifying
#sizeto do.countinstead of.count(:all)for thedistinctcase only. Unfortunately, that won’t work because of what follows.The two meanings of
select(:name).distinctThe caller of
select(:name).distinctmight mean two different things:COUNT(DISTINCT name)- the number of unique not-NULLnames performed (3 in this case)COUNT(*) FROM (SELECT DISTINCT name)- the number of unique names, including theNULL(4 in this case)Note that
select(:name).distinct.countreturns 3 (which is the first meaning), butselect(:name).distinct.to_areturn 4 different records (which is the second):I think both should return 4. That way
#countwill mean “number of records in the relation” and it will be consistent with.sizeand.to_a.size. The user can always access the old behavior withActivity.count('DISTINCT name').To implement this, we’ll need a nested query (the
ASis required in PostgreSQL):It might seem uncool, but let’s get back to those errors.
select(:name, :day)In this case, neither
.countnor.distinct.countproduce valid queries:To be fair, the second is a valid query in MySQL, but not in the other databases. So if
select(:name, :day).distinctis to be working, it needs to make a nested query:Note that this is the second meaning of
select(:name, day).distinctas described previously – countingNULLvalues. If we want to count only not-NULLs, it will have to be something in the like this:But as mentioned earlier, I think it’s better to have
distinctcount theNULLvalues.countvs.count(:all)Let’s put the distinct issue aside for a bit and take a look at
select.count. Currently:select(:name).countreturns 20select(:name).to_a.sizereturns 28select(:name).sizereturns 28select(:name).count(:all)return 28select(:name, :age).countraises an errorselect(:name, :age).count(:all)returns 28Probably not what one would expect.
I can think of only one reason to have that – allowing the users to do
Activity.select(:name).countand mean “excludeNULLvalues”. I don’t think this is a good idea, because (1) they can already do it with.count('name')or.count('DISTINCT name') and (2) it can’t be generalized for multiple columns.I’m proposing to get rid of this functionality and have
count()behave always ascount(:all). That way we can get rid ofcount(:all)as a special case.In summary
Here’s a list of the proposed changes:
select(:name).distinct.countcountNULLvalues.select(:name).distinct.countbe implemented as a sub-query..count()to include not-NULLvalues specified in.select().count('name')or.count('DISTINCT name')instead.countandcount(:all).This will lead to the following table:
countcount(:all)sizeto_a.sizeselect(:name)select(:name).distinctalldistinctselect(:name, :day)select(:name, :day).distinctThe values that will change are in bold. Note how each row has the same value, which I think is pretty neat.
It’s a breaking change, but I think it makes things consistent.
That being said, I’d love to take a swing at implementing this. If you think this makes sense, just give me the green light 😄
/cc @senny
Fixed by c6cd9a59f200863ccfe8ad1d9c5a8876c39b9c5c.