mariadb-operator: [Bug] Cannot add a semi-sync replica to existing cluster
Describe the bug I have a running semi-sync replica cluster with 1 master and 1 replica. Trying to add a new replica fails since it cannot find binlogs required for replica
2023-06-19 9:47:12 12 [Note] Master connection name: 'mariadb-operator' Master_info_file: 'master-mariadb@002doperator.info' Relay_info_file: 'relay-log-mariadb@002doperator.info'
2023-06-19 9:47:12 12 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='mariadb-myns-0.mariadb-myns.myns.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-06-19 9:47:12 12 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos
2023-06-19 9:47:12 13 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master 'repl@mariadb-myns-0.mariadb-myns.myns.svc.cluster.local:3306' in log '' at position 4
2023-06-19 9:47:12 14 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mariadb-myns-relay-bin-mariadb@002doperator.000001' position: 4; GTID position ''
2023-06-19 9:47:12 13 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master 'repl@mariadb-myns-0.mariadb-myns.myns.svc.cluster.local:3306',replication starts at GTID position ''
2023-06-19 9:47:12 13 [ERROR] Master 'mariadb-operator': Error reading packet from server: Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged. (server_errno=1236)
2023-06-19 9:47:12 13 [ERROR] Master 'mariadb-operator': Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.', Internal MariaDB error code: 1236
2023-06-19 9:47:12 13 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'FIRST', position 4; GTID position , master mariadb-myns-0.mariadb-myns.myns.svc.cluster.local:3306
2023-06-19 9:47:12 13 [Note] Master 'mariadb-operator': Failed to gracefully kill our active semi-sync connection with primary. Silently closing the connection.
Expected behaviour New replica initializes properly.
Steps to reproduce the bug
- Create cluster with semi-sync replication and 2 replicas
- Run the cluster until older binlog file are purged
- Try to scale replicas to 3
Environment details:
- Kubernetes version: 1.26.5
- mariadb-operator version: v0.0.15
- Install method: helm
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 26 (15 by maintainers)
Okay, my turn:
Disable mariadb-operator:
Double check that
.spec.replication.primary.podIndexis correct and points to the same pod specified in.status.currentPrimaryPodIndex. I had an issue when operator was setting emptygtid_slave_poson promoting master, so don’t repeat my mistakes, and check that:let’s define variables to reuse them later:
Copy backup form
$MASTERto$REPLICA:If mariabackup stuck on
Waiting for log copy thread to read lsn 568204542376, then restart source pod and try againAfter backup created, start recovering:
And run:
Now restart the pod:
Exec into it:
Set
gtid_slave_posfrom the output above:Enable operator
Wait when it configure replication, and watch the status:
Seconds_Behind_Mastershould slowly go downI think this is not the correct way to get it, since
gtid_current_poson the master increases with each operation, and the replica needs to restart replication where it left.Just to make my point clearer, I did the following:
replicas: 2, the node shutting down logs its current GTIDgtid_current_posincreases:replicas: 3, the replica restarts replication fromgtid_current_pos= 0-10-41So reading the value from the master does not work, the replica already knows its last
gtid_current_poson restart.@fchiacchiaretta awesome investigation, thanks for the level of detail 💯 🥇
I’ve managed to reproduce the issue by:
MariaDBinstancemax-binlog-size=4096andbinlog_expire_logs_seconds=10As a result, the binary logs got deleted
At this point, upscaling to 3 results in the following logs in the new replica:
This happens with both MariaDB version
10.11.3and10.6.13. Having a look now 👀Hello @mmontes11 , thank you for your answer. Here my
MariaDBresourceI did no special operation to trigger binlog purging, but on master pod I see that some files have been purged:
I checked for
expire_log_daysandbinlog_expire_logs_secondsbut it seems that neither of them have been set.I’m trying to reproduce the issue on a test cluster too, I’ll be back if I have some additional results.