vitess: Failed backup puts tablet into unhealthy state

I have a shard where the backup fails about 50% of the time. The shard in question has 225 GB of data. Once it fails, it never recovers until I delete the k8s pod and it restarts.

While the error itself is different, the outcome sounds the same as what @rnavarro reported in #2238.

vttablet logs

I0121 13:35:09.379272       1 mysqld.go:329] Mysqld.Shutdown
I0121 13:35:09.379276       1 mysqld.go:333] executing Mysqld.Shutdown() remotely via mysqlctld server: /vtdataroot/mysqlctl.sock
I0121 13:35:14.420687       1 backup.go:413] found 132 files to backup
E0121 13:58:54.605921       1 engine.go:185] periodic schema reload failed: could not get table list for reload: net.Dial(/vtdataroot/tabletdata/mysql.sock) to local server failed: dial unix /vtdataroot/tabletdata/mysql.sock: connect: no such file or directory (errno 2002) (sqlstate HY000)
I0121 13:58:54.606008       1 tabletserver.go:320] TabletServer state: NOT_SERVING (Not Serving) -> NOT_SERVING (Transitioning)
I0121 13:58:54.606070       1 tabletserver.go:320] TabletServer state: NOT_SERVING (Transitioning) -> NOT_SERVING (Not Serving)
I0121 13:58:54.606086       1 tabletserver.go:682] Check MySQL failed. Shutting down query service
I0121 13:58:54.606105       1 tabletserver.go:320] TabletServer state: NOT_SERVING (Not Serving) -> SHUTTING_DOWN (Shutting Down)
I0121 13:58:54.606119       1 tabletserver.go:583] Executing complete shutdown.
I0121 13:58:54.606385       1 tabletserver.go:590] Shutdown complete.
I0121 13:58:54.606409       1 tabletserver.go:320] TabletServer state: SHUTTING_DOWN (Shutting Down) -> NOT_SERVING (Not Connected)
E0121 14:35:09.220352       1 backup.go:496] failed to close file [20 context deadline exceeded]: %!v(MISSING)
I0121 14:35:09.220485       1 mysqld.go:177] executing Mysqld.Start() remotely via mysqlctld server: /vtdataroot/mysqlctl.sock
E0121 14:35:09.220545       1 backup.go:287] backup is not usable, aborting it: [can't restart mysqld: context deadline exceeded: <nil>]
E0121 14:35:09.220819       1 backup.go:295] failed to finish backup: [context deadline exceeded]: %!v(MISSING)
E0121 14:35:09.220935       1 rpc_backup.go:85] mysql backup command returned error: can't restart mysqld: context deadline exceeded: <nil>
I0121 14:35:09.220982       1 state_change.go:130] Executing post-action state refresh: after backup
W0121 14:35:09.221009       1 state_change.go:141] Failed rereading tablet after after backup - services may be inconsistent: deadline exceeded: tablets/aks-1588972600/Tablet
W0121 14:35:09.221070       1 rpc_server.go:91] TabletManager.Backup(concurrency:4 )(on aks-1588972600 from ) error: refreshTablet failed rereading tablet after after backup: deadline exceeded: tablets/aks-1588972600/Tablet

mysqlctld logs

I0121 13:35:09.393314       1 mysqld.go:329] Mysqld.Shutdown
I0121 13:35:09.394243       1 mysqld.go:367] No mysqld_shutdown hook, running mysqladmin directly
I0121 13:35:09.394887       1 mysqld.go:433] execCmd: /usr/bin/mysqladmin /usr/bin/mysqladmin [--defaults-extra-file=/tmp/example225411938 --shutdown-timeout=300 --connect-timeout=30 --wait=10 shutdown]
I0121 13:35:14.023813       1 mysqld.go:260] Mysqld.Start(1548059306) exit: <nil>
I0121 13:35:14.417685       1 mysqld.go:447] execCmd: /usr/bin/mysqladmin output: 
I0121 13:35:14.417728       1 mysqld.go:408] Mysqld.Shutdown: waiting for socket file (/vtdataroot/tabletdata/mysql.sock) and pid file (/vtdataroot/tabletdata/mysql.pid) to disappear

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 20 (20 by maintainers)

Most upvoted comments

@derekperkins You’re correct, it was a flag to the vtctlclient call:

vtctlclient -action_timeout 2h0m0s