rails: Running multiple migrations simultaneously can cause race conditions (on Postgres, probably on others)
Currently it seems that the rake db:migrate
task does not try to get any kind of exclusive ‘migrator’ lock on the database (at least for Postgres, I haven’t tested for other databases).
This results in a problem where running multiple migrations simultaneously from different threads/machines can cause undefined behavior or migration failures. This is especially a problem with Rails deployments to multi-instance clusters.
Here is a quick 'n dirty example case where we simply add a column:
class TestDummy < ActiveRecord::Migration
def change
add_index :review_requests, :url_code, using: :hash, name: 'hash_index_review_requests_on_url_code'
end
end
Now let’s run this in three simultaneous threads (simulating a deploy to three instances simultaneously) and see what happens:
=>$ for run in {1..3}
do
bundle exec rake db:migrate &
done
[2] 88935
[3] 88936
[4] 88937
=>$ == 20151027212412 TestDummy: migrating ========================================
-- add_index(:review_requests, :url_code, {:using=>:hash, :name=>"hash_index_review_requests_on_url_code"})
== 20151027212412 TestDummy: migrating ========================================
-- add_index(:review_requests, :url_code, {:using=>:hash, :name=>"hash_index_review_requests_on_url_code"})
== 20151027212412 TestDummy: migrating ========================================
-- add_index(:review_requests, :url_code, {:using=>:hash, :name=>"hash_index_review_requests_on_url_code"})
-> 1.1617s
== 20151027212412 TestDummy: migrated (1.1619s) ===============================
rake aborted!
StandardError: An error has occurred, this and all later migrations canceled:
PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "pg_class_relname_nsp_index"
DETAIL: Key (relname, relnamespace)=(hash_index_review_requests_on_url_code, 5605052) already exists.
: CREATE INDEX "hash_index_review_requests_on_url_code" ON "review_requests" USING hash ("url_code")/Users/sam/code/fosubo/fosubo-app/db/migrate/20151027212412_test_dummy.rb:3:in `change'
ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "pg_class_relname_nsp_index"
DETAIL: Key (relname, relnamespace)=(hash_index_review_requests_on_url_code, 5605052) already exists.
: CREATE INDEX "hash_index_review_requests_on_url_code" ON "review_requests" USING hash ("url_code")
rake aborted!
StandardError: An error has occurred, this and all later migrations canceled:
PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "pg_class_relname_nsp_index"
DETAIL: Key (relname, relnamespace)=(hash_index_review_requests_on_url_code, 5605052) already exists.
: CREATE INDEX "hash_index_review_requests_on_url_code" ON "review_requests" USING hash ("url_code")/Users/sam/code/fosubo/fosubo-app/db/migrate/20151027212412_test_dummy.rb:3:in `change'
PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "pg_class_relname_nsp_index"
DETAIL: Key (relname, relnamespace)=(hash_index_review_requests_on_url_code, 5605052) already exists.
/Users/sam/code/fosubo/fosubo-app/db/migrate/20151027212412_test_dummy.rb:3:in `change'
ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "pg_class_relname_nsp_index"
DETAIL: Key (relname, relnamespace)=(hash_index_review_requests_on_url_code, 5605052) already exists.
: CREATE INDEX "hash_index_review_requests_on_url_code" ON "review_requests" USING hash ("url_code")
/Users/sam/code/fosubo/fosubo-app/db/migrate/20151027212412_test_dummy.rb:3:in `change'
Tasks: TOP => db:migrate
(See full trace by running task with --trace)
/Users/sam/code/fosubo/fosubo-app/db/migrate/20151027212412_test_dummy.rb:3:in `change'
PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "pg_class_relname_nsp_index"
DETAIL: Key (relname, relnamespace)=(hash_index_review_requests_on_url_code, 5605052) already exists.
/Users/sam/code/fosubo/fosubo-app/db/migrate/20151027212412_test_dummy.rb:3:in `change'
Tasks: TOP => db:migrate
(See full trace by running task with --trace)
Result: disaster, two of our deploys failed.
In this particular case it wouldn’t cause much damage, just some exceptions. But we can imagine scenarios where running a migration more than once could cause serious issues, for example:
class SecondTestDummy < ActiveRecord::Migration
# John has too much money in his account because of the bug we recently fixed, make sure to reset it.
def change
john = User.find(1)
john.money -= 100
john.save!
end
end
The expectation for migrations is that they get run only once. However, if we ran repeated the same process with the migration above, it is possible that John could end up with -200 or -300 money, rather than -100 as intended.
Personally, I can’t imagine a scenario where you would EVER want to allow more than one migration to run at a time.
I propose a fix to this which would enforce that rule by wrapping each migration with exclusive table locking on the schema_migrations
table. If another migration is running, the process will have to wait until it can get a lock before running.
This would solve the above case because only one migrating process would get the lock first. It would then apply the migration and add the new schema version before releasing the lock. Now a new process takes the lock and sees that the migration has already been applied, so it does nothing instead of attempting to add an index that already exists.
If someone can confirm to me this is the right approach, I would be happy to write a patch.
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Comments: 17 (7 by maintainers)
Commits related to this issue
- Add ACCESS EXCLUSIVE table locking for migrations - Addresses issue #22092 - Currently dependent on adapter, only postgresql supports it because it's the only one that supports DDL transactions — committed to samsondav/rails by deleted user 9 years ago
- Use advisory locks to prevent concurrent migrations - Addresses issue #22092 - Works on Postgres and MySQL - Uses advisory locks because of two important properties: 1. The can be obtained outside ... — committed to samsondav/rails by deleted user 9 years ago
Yes, they are running the same migrations against the same database.
And yes, the solution is to run it on only one machine, but there is another problem with doing this - if you run migrations on one machine and skip it on the others, the other instances are likely to restart their servers before the migration has completed resulting in an out of date schema cache.
The workaround is to do a rolling deploy, where you enable migrations on all instances, and fully complete the deploy including the migration on each instance before moving onto the next one. Actually I even wrote a gem to do this: https://github.com/fosubo/opworks_interactor
But this means that deploys are much slower than they need to be. And most people are not going to bother with this kind of solution (or even understand why it might be an issue).
Regardless, running the migration on all instances is the default mode for Rails apps deployed to Amazon Opsworks so I would imagine this is probably a fairly widespread problem.
The correct way to do migrations is to enforce isolation using database locks.
As developers we would like to have confidence that a migration will be run once and exactly once, regardless of the deployment method.
@rafaelfranca You are correct, in an ideal world the developer would do work to ensure only one migration process is run at the same time.
However in our case, this is is actually a common problem that we run into on a regular basis while deploying to Amazon Opsworks, where migrations are run by default on multiple machines simultaneously when deploying.
And in fact, anybody who deploys to multiple machines simultaneously and runs migrations naively risks running into this problem. It’s made all the more insidious because this will be an intermittent problem. It MAY work fine many times and then bite you at random due to timing differences on the machines.
I certainly think it would be good engineering to enforce a migration lock in any case just to be sure only one process can run it. We can add a timeout in case one of the other migrations is really taking a long time (sensible default would probably be something like 5 or 10 minutes).
@saurabhbhatia yes, we are using the default recipe. The default recipe runs migrations on all machines. You probably haven’t noticed issues because your migrations complete fairly quickly.
We have some migrations for large tables (500,000+ rows) that take a long time, and this is where the race is triggered.
@rafaelfranca PR is here: https://github.com/rails/rails/pull/22103
Is the multiple machines running the same set of migrations against the same database? If that is the case does not make sense to run the migrations in only one machine. That way you would never be in this case.