gh-ost: ERROR Timeout while waiting for events up to lock

We have a pretty large and busy (at some times) table where we try to add column xxx int null, everything seems fine until the cut-over phase, it gets an error and then retries for some time but never succeeds. I have checked other issues here which also have this error but haven’t found anything really. The binlog is not stuck as it keeps applying and rotating. This is on RDS Aurora, normal setup, we haven’t changed any parameters but starting to think that some parameter might need to be changed in the DB? But not sure which in that case. We are running latest master and have tried some different command flags, like for ex --cut-over-exponential-backoff and increasing --cut-over-lock-timeout-seconds and --default-retries. This is the full command right now:

gh-ost \ --user=$user \ --password=$password \ --host=$host \ --database=$database \ --table=$table \ --alter="$alter" \ --max-load=Threads_running=25 \ --critical-load=Threads_running=1000 \ --chunk-size=1000 \ --max-lag-millis=1500 \ --verbose \ --alter="engine=innodb" \ --allow-master-master \ --exact-rowcount \ --concurrent-rowcount \ --default-retries=120 \ --allow-on-master \ --initially-drop-ghost-table \ --initially-drop-old-table \ --cut-over-lock-timeout-seconds=10 \ --assume-rbr \ --cut-over-exponential-backoff \ --execute \

Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m10s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:1363370; Lag: 0.01s, State: migrating; ETA: due
2019-11-06 14:04:27 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:27 ERROR 2019-11-06 14:04:27 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:27 INFO Looking for magic cut-over table
2019-11-06 14:04:27 INFO Will now proceed to drop magic table and unlock tables
2019-11-06 14:04:27 INFO Dropping magic cut-over table
2019-11-06 14:04:27 INFO Dropping magic cut-over table
2019-11-06 14:04:27 INFO Dropping table `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Table dropped
2019-11-06 14:04:27 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Tables unlocked
2019-11-06 14:04:27 INFO Grabbing voluntary lock: gh-ost.428716346.lock
2019-11-06 14:04:27 INFO Setting LOCK timeout as 20 seconds
2019-11-06 14:04:27 INFO Looking for magic cut-over table
2019-11-06 14:04:27 INFO Creating magic cut-over table `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Magic cut-over table created
2019-11-06 14:04:27 INFO Locking `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Tables locked
2019-11-06 14:04:27 INFO Session locking original & magic tables is 428716346
2019-11-06 14:04:27 INFO Writing changelog state: AllEventsUpToLockProcessed:1573049067764852856
2019-11-06 14:04:27 INFO Waiting for events up to lock
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m15s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:16394106; Lag: 0.01s, State: migrating; ETA: due
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m20s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:32408062; Lag: 0.01s, State: migrating; ETA: due
2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:37 ERROR 2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:37 INFO Looking for magic cut-over table
2019-11-06 14:04:37 INFO Will now proceed to drop magic table and unlock tables
2019-11-06 14:04:37 INFO Dropping magic cut-over table
2019-11-06 14:04:37 INFO Dropping magic cut-over table
2019-11-06 14:04:37 INFO Dropping table `xxx`.`_xxx_del`
2019-11-06 14:04:37 INFO Table dropped
2019-11-06 14:04:37 INFO Removing socket file: /tmp/gh-ost.xxx.xxx.sock
2019-11-06 14:04:37 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:37 FATAL 2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 2
Comments: 28 (8 by maintainers)

Most upvoted comments

My company ran into the exact same problem while using gh-ost to run migrations against our Aurora clusters.

After a lot of deep diving, we finally put together this PR which makes the problem visible and protects against cutOver failures.

Now when we run migrations against Aurora, we can see that the value of HeartbeatLag grows larger and larger during a long running migration on a busy table:

Copy: 23000/2172914 1.1%; Applied: 0; Backlog: 0/1000; Time: 1m10s(total), 3s(copy); streamer: mysql-bin-changelog.046700:117936636; Lag: 0.12s, HeartbeatLag: 0.12s, State: migrating; ETA: 4m41s 
Copy: 75000/2172914 3.5%; Applied: 0; Backlog: 0/1000; Time: 1m15s(total), 8s(copy); streamer: mysql-bin-changelog.046700:128606180; Lag: 0.12s, HeartbeatLag: 0.12s, State: migrating; ETA: 3m44s 
Copy: 128500/2172914 5.9%; Applied: 0; Backlog: 0/1000; Time: 1m20s(total), 13s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 2.52s, State: migrating; ETA: 3m26s 
Copy: 179000/2172914 8.2%; Applied: 0; Backlog: 0/1000; Time: 1m25s(total), 18s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 7.52s, State: migrating; ETA: 3m20s 
Copy: 228000/2172914 10.5%; Applied: 0; Backlog: 0/1000; Time: 1m30s(total), 23s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.12s, HeartbeatLag: 12.52s, State: migrating; ETA: 3m16s 
Copy: 279500/2172914 12.9%; Applied: 0; Backlog: 0/1000; Time: 1m35s(total), 28s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 17.52s, State: migrating; ETA: 3m9s 
Copy: 330000/2172914 15.2%; Applied: 0; Backlog: 0/1000; Time: 1m40s(total), 33s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 22.52s, State: migrating; ETA: 3m4s 
Copy: 378000/2172914 17.4%; Applied: 0; Backlog: 0/1000; Time: 1m45s(total), 38s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 27.52s, State: migrating; ETA: 3m0s 
Copy: 427000/2172914 19.7%; Applied: 0; Backlog: 0/1000; Time: 1m50s(total), 43s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 32.52s, State: migrating; ETA: 2m55s 
Copy: 477500/2172914 22.0%; Applied: 0; Backlog: 0/1000; Time: 1m55s(total), 48s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 37.52s, State: migrating; ETA: 2m50s 
Copy: 530000/2172914 24.4%; Applied: 0; Backlog: 0/1000; Time: 2m0s(total), 53s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 42.52s, State: migrating; ETA: 2m44s

By the time we get to the cut-over stage, HeartbeatLag can be minutes. This means that there is absolutely no chance that cut-over could succeed, and instead, it will timeout after 3 seconds with ERROR Timeout while waiting for events up to lock. This will then retry up to 60 times.

In the PR that we have submitted, we simply wait before cutOver for the HeartbeatLag to reduce to a safe level before continuing. This looks as follows:

Row copy complete
Waiting for heartbeat lag to be low enough to proceed
current HeartbeatLag (44.16s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue

Eventually, HeartbeatLag will reduce enough to proceed with cutOver succeeded on the first try.

Heartbeat lag is low enough, proceeding

Update: Because of the visibility into HeartbeatLag that this PR enabled, we were able to experiment with a number of Aurora parameters and eventually figure out why HeartbeatLag was growing so quickly.

On our clusters, the parameter aurora_binlog_replication_max_yield_seconds defaulted to 60 seconds. By changing this value to 0 (on a test cluster), we were able to completely remove the HeartbeatLag that we were seeing.

Instead, we started to see the following while running migrations:

Copy: 2342000/2256823 103.8%; Applied: 0; Backlog: 0/1000; Time: 4m0s(total), 3m51s(copy); streamer: mysql-bin-changelog.046761:69254547; Lag: 0.11s, HeartbeatLag: 0.11s, State: migrating; ETA: due
2021-02-03 13:57:36 INFO Row copy complete
Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 4m5s(total), 3m56s(copy); streamer: mysql-bin-changelog.046761:81901819; Lag: 0.01s, HeartbeatLag: 0.07s, State: migrating; ETA: due 
2021-02-03 13:57:36 INFO Waiting for heartbeat lag to be low enough to proceed
2021-02-03 13:57:36 INFO Heartbeat lag is low enough, proceeding

This parameter is described in Aurora’s Optimizing binary log replication doc.

Warning: We have not changed this parameter in our production environment as it may not be safe for us to do so. We are actually leaning towards keeping this value at its default of 60 seconds and instead relying on the safety added to gh-ost by this PR.

ccoffey on Feb 3, 2021

Ha ok!

shlomi-noach on Dec 10, 2019

Unfortunately, after setting aurora_enable_repl_bin_log_filtering: 0 it is still the exact same logs and error 😦 Any more ideas? 😃

knepe on Dec 10, 2019