gh-ost: ERROR Timeout while waiting for events up to lock
We have a pretty large and busy (at some times) table where we try to add column xxx int null
, everything seems fine until the cut-over phase, it gets an error and then retries for some time but never succeeds. I have checked other issues here which also have this error but haven’t found anything really.
The binlog is not stuck as it keeps applying and rotating.
This is on RDS Aurora, normal setup, we haven’t changed any parameters but starting to think that some parameter might need to be changed in the DB? But not sure which in that case.
We are running latest master and have tried some different command flags, like for ex --cut-over-exponential-backoff
and increasing --cut-over-lock-timeout-seconds
and --default-retries
.
This is the full command right now:
gh-ost \ --user=$user \ --password=$password \ --host=$host \ --database=$database \ --table=$table \ --alter="$alter" \ --max-load=Threads_running=25 \ --critical-load=Threads_running=1000 \ --chunk-size=1000 \ --max-lag-millis=1500 \ --verbose \ --alter="engine=innodb" \ --allow-master-master \ --exact-rowcount \ --concurrent-rowcount \ --default-retries=120 \ --allow-on-master \ --initially-drop-ghost-table \ --initially-drop-old-table \ --cut-over-lock-timeout-seconds=10 \ --assume-rbr \ --cut-over-exponential-backoff \ --execute \
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m10s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:1363370; Lag: 0.01s, State: migrating; ETA: due
2019-11-06 14:04:27 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:27 ERROR 2019-11-06 14:04:27 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:27 INFO Looking for magic cut-over table
2019-11-06 14:04:27 INFO Will now proceed to drop magic table and unlock tables
2019-11-06 14:04:27 INFO Dropping magic cut-over table
2019-11-06 14:04:27 INFO Dropping magic cut-over table
2019-11-06 14:04:27 INFO Dropping table `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Table dropped
2019-11-06 14:04:27 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Tables unlocked
2019-11-06 14:04:27 INFO Grabbing voluntary lock: gh-ost.428716346.lock
2019-11-06 14:04:27 INFO Setting LOCK timeout as 20 seconds
2019-11-06 14:04:27 INFO Looking for magic cut-over table
2019-11-06 14:04:27 INFO Creating magic cut-over table `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Magic cut-over table created
2019-11-06 14:04:27 INFO Locking `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Tables locked
2019-11-06 14:04:27 INFO Session locking original & magic tables is 428716346
2019-11-06 14:04:27 INFO Writing changelog state: AllEventsUpToLockProcessed:1573049067764852856
2019-11-06 14:04:27 INFO Waiting for events up to lock
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m15s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:16394106; Lag: 0.01s, State: migrating; ETA: due
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m20s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:32408062; Lag: 0.01s, State: migrating; ETA: due
2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:37 ERROR 2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:37 INFO Looking for magic cut-over table
2019-11-06 14:04:37 INFO Will now proceed to drop magic table and unlock tables
2019-11-06 14:04:37 INFO Dropping magic cut-over table
2019-11-06 14:04:37 INFO Dropping magic cut-over table
2019-11-06 14:04:37 INFO Dropping table `xxx`.`_xxx_del`
2019-11-06 14:04:37 INFO Table dropped
2019-11-06 14:04:37 INFO Removing socket file: /tmp/gh-ost.xxx.xxx.sock
2019-11-06 14:04:37 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:37 FATAL 2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 2
- Comments: 28 (8 by maintainers)
My company ran into the exact same problem while using
gh-ost
to run migrations against our Aurora clusters.After a lot of deep diving, we finally put together this PR which makes the problem visible and protects against cutOver failures.
Now when we run migrations against Aurora, we can see that the value of
HeartbeatLag
grows larger and larger during a long running migration on a busy table:By the time we get to the cut-over stage,
HeartbeatLag
can be minutes. This means that there is absolutely no chance that cut-over could succeed, and instead, it will timeout after 3 seconds withERROR Timeout while waiting for events up to lock
. This will then retry up to 60 times.In the PR that we have submitted, we simply wait before cutOver for the
HeartbeatLag
to reduce to a safe level before continuing. This looks as follows:Eventually,
HeartbeatLag
will reduce enough to proceed with cutOver succeeded on the first try.Update: Because of the visibility into
HeartbeatLag
that this PR enabled, we were able to experiment with a number of Aurora parameters and eventually figure out whyHeartbeatLag
was growing so quickly.On our clusters, the parameter
aurora_binlog_replication_max_yield_seconds
defaulted to 60 seconds. By changing this value to 0 (on a test cluster), we were able to completely remove theHeartbeatLag
that we were seeing.Instead, we started to see the following while running migrations:
This parameter is described in Aurora’s Optimizing binary log replication doc.
Warning: We have not changed this parameter in our production environment as it may not be safe for us to do so. We are actually leaning towards keeping this value at its default of 60 seconds and instead relying on the safety added to
gh-ost
by this PR.Ha ok!
Unfortunately, after setting
aurora_enable_repl_bin_log_filtering: 0
it is still the exact same logs and error 😦 Any more ideas? 😃