rails: ActiveRecord::Relation .size regression with distinct select

This test case works as written in Rails 3.2, but fails on Rails 4.1.4 and master:

unless File.exist?('Gemfile')
  File.write('Gemfile', <<-GEMFILE)
    source 'https://rubygems.org'
    gem 'rails', github: 'rails/rails'
    gem 'arel', github: 'rails/arel'
    gem 'rack', github: 'rack/rack'
    gem 'sqlite3'
  GEMFILE

  system 'bundle'
end

require 'bundler'
Bundler.setup(:default)

require 'active_record'
require 'minitest/autorun'
require 'logger'

# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: 'sqlite3', database: ':memory:')
ActiveRecord::Base.logger = Logger.new(STDOUT)

ActiveRecord::Schema.define do
  create_table :posts do |t|
    t.integer :foo
  end
end

class Post < ActiveRecord::Base
end

class BugTest < Minitest::Test
  def test_size_bug
    Post.create! foo: 1
    Post.create! foo: 2
    Post.create! foo: 2

    assert_equal [1, 2], Post.select(:foo).uniq.map(&:foo)
    assert_equal 2, Post.select(:foo).uniq.length
    assert_equal 2, Post.select(:foo).uniq.count
    # Rails 4.1+ selects COUNT(DISTINCT id) instead of COUNT(DISTINCT foo) as above:
    assert_equal 2, Post.select(:foo).uniq.size
  end
end

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Reactions: 1
  • Comments: 16 (12 by maintainers)

Most upvoted comments

That being said, I’d love to take a swing at implementing this change. If you think this makes sense, just give me the green light 😄

This is hairy. It took me a while to to understand what’s going on. It was broken by the fix of #13648 (in 968c581ea34b5236af14805e6a77913b1cb36238). It seems to me that there is a deeper problem in count. I will now share the results of my “investigation” and then propose a solution.

Some data first

In order to illustrate the problem, let me show the behavior of different combinations of select, distinct and count. I’m using this script to extract the results. Here’s the data:

{
  monday: 'running',
  tuesday: 'biking',
  wednesday: 'swimming',
  thursday: 'running',
  friday: 'biking',
  saturday: nil,
  sunday: nil,
}.each do |day, name|
  4.times do
    Activity.create! day: day, name: name
  end
end

Basically, we’re doing sports for four weeks. We exercise on the week days and take a break in the weekend.

Here’s how the different combinations of select, distinct and count behave. The row label is applied first, followed by the column label. That is, the second cell on the second row is the result of Activity.select(:name).count:

count count(:all) size to_a.size
select(:name) 20 28 28 28
select(:name).distinct 3 28 28 4
all 28 28 28 28
distinct 28 28 28 28
select(:name, :day) ERROR 28 28 28
select(:name, :day).distinct ERROR 28 28 7

Note the two errors – I’ll get back to them later. I’ve put emphasis on the ones related to this issue. Let me start with select(:name).distinct.

.size vs. .to_a.size

Note how .size and .to_a.size return different results for select(:name).distinct. #size returns 28 when the relation is not loaded and 4 when it is. This seems to break an important invariant – #size should return the same result whether loaded or not. To be extra explicit, here’s a test case that will fail:

  def test_size_with_select_distinct
    relation      = Activity.select(:name).distinct
    unloaded_size = relation.size
    loaded_size   = relation.load.size

    assert_equal unloaded_size, loaded_size
  end

I initially thought about modifying #size to do .count instead of .count(:all) for the distinct case only. Unfortunately, that won’t work because of what follows.

The two meanings of select(:name).distinct

The caller of select(:name).distinct might mean two different things:

  1. COUNT(DISTINCT name) - the number of unique not-NULL names performed (3 in this case)
  2. COUNT(*) FROM (SELECT DISTINCT name) - the number of unique names, including the NULL (4 in this case)

Note that select(:name).distinct.count returns 3 (which is the first meaning), but select(:name).distinct.to_a return 4 different records (which is the second):

irb> Activity.select(:name).distinct.count
(0.7ms)  SELECT DISTINCT COUNT(DISTINCT "activities"."name") FROM "activities"
=> 3

irb> Activity.select(:name).distinct.to_a
Activity Load (0.5ms)  SELECT DISTINCT "activities"."name" FROM "activities"
=> [
  #<Activity name: nil>,
  #<Activity name: "swimming">,
  #<Activity name: "running">,
  #<Activity name: "biking">
]

I think both should return 4. That way #count will mean “number of records in the relation” and it will be consistent with .size and .to_a.size. The user can always access the old behavior with Activity.count('DISTINCT name').

To implement this, we’ll need a nested query (the AS is required in PostgreSQL):

SELECT COUNT(*) FROM (SELECT DISTINCT name FROM activities) AS counting;

It might seem uncool, but let’s get back to those errors.

select(:name, :day)

In this case, neither .count nor .distinct.count produce valid queries:

irb>  Activity.select(:name, :day).count
(0.6ms)  SELECT COUNT(name, day) FROM "activities"
PG::UndefinedFunction: ERROR:  function count(character varying, character varying) does not exist
LINE 1: SELECT COUNT(name, day) FROM "activities"
               ^

irb> Activity.select(:name, :day).distinct.count
(0.5ms)  SELECT DISTINCT COUNT(DISTINCT name, day) FROM "activities"
PG::UndefinedFunction: ERROR:  function count(character varying, character varying) does not exist
LINE 1: SELECT DISTINCT COUNT(DISTINCT name, day) FROM "activities"
                           ^

To be fair, the second is a valid query in MySQL, but not in the other databases. So if select(:name, :day).distinct is to be working, it needs to make a nested query:

SELECT COUNT(*) FROM (SELECT DISTINCT name, day FROM "activities") AS counting;

Note that this is the second meaning of select(:name, day).distinct as described previously – counting NULL values. If we want to count only not-NULLs, it will have to be something in the like this:

SELECT COUNT(*)
FROM (SELECT DISTINCT name, day
      FROM "activities") AS counting
WHERE counting.name IS NULL
  AND counting.day IS NULL

But as mentioned earlier, I think it’s better to have distinct count the NULL values.

count vs. count(:all)

Let’s put the distinct issue aside for a bit and take a look at select.count. Currently:

  • select(:name).count returns 20
  • select(:name).to_a.size returns 28
  • select(:name).size returns 28
  • select(:name).count(:all) return 28
  • select(:name, :age).count raises an error
  • select(:name, :age).count(:all) returns 28

Probably not what one would expect.

I can think of only one reason to have that – allowing the users to do Activity.select(:name).count and mean “exclude NULL values”. I don’t think this is a good idea, because (1) they can already do it with .count('name') or .count('DISTINCT name') and (2) it can’t be generalized for multiple columns.

I’m proposing to get rid of this functionality and have count() behave always as count(:all). That way we can get rid of count(:all) as a special case.

In summary

Here’s a list of the proposed changes:

  • Have select(:name).distinct.count count NULL values.
  • Have select(:name).distinct.count be implemented as a sub-query.
  • Remove the functionality of .count() to include not-NULL values specified in .select()
  • Have users do .count('name') or .count('DISTINCT name') instead.
  • Get rid of the distinction between count and count(:all).

This will lead to the following table:

count count(:all) size to_a.size
select(:name) 28 28 28 28
select(:name).distinct 4 4 4 4
all 28 28 28 28
distinct 28 28 28 28
select(:name, :day) 28 28 28 28
select(:name, :day).distinct 7 7 7 7

The values that will change are in bold. Note how each row has the same value, which I think is pretty neat.

It’s a breaking change, but I think it makes things consistent.


That being said, I’d love to take a swing at implementing this. If you think this makes sense, just give me the green light 😄

/cc @senny

Fixed by c6cd9a59f200863ccfe8ad1d9c5a8876c39b9c5c.