mariadb-operator: [Bug] Galera complete cluster failure recovery
Yesterday evening I’ve deployed mariadb-operator for a development Galera cluster. I then started importing all table structures, which went fine. After that I tried to import some big sql data files in parallel over the course of the night. Biggest one being a 13G file. All files were originally created using mysqldump from a production database. Unfortunately when I woke up, I saw that all 3 of galera’s instances had crashed and that the pods were in a CrashLoop.
Describe the bug
Unfortunately, this cluster does not have additional logging yet. So figuring out why the first crash happened is not an option anymore. What is an option, is to bootstrap the cluster and resume operatings from the last known good state. Unfortunately, mariadb-operator does not seem to provide a way to do this. All tries to start the servers end in the following logs:
2023-08-02 07:09:00+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.0.2+maria~ubu2204 started.
2023-08-02 07:09:00+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2023-08-02 07:09:00+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.0.2+maria~ubu2204 started.
2023-08-02 07:09:01+00:00 [Note] [Entrypoint]: MariaDB upgrade not required
2023-08-02 7:09:01 0 [Note] Starting MariaDB 11.0.2-MariaDB-1:11.0.2+maria~ubu2204-log source revision 0005f2f06c8e1aea4915887decad67885108a929 as process 1
2023-08-02 7:09:01 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: 00000000-0000-0000-0000-000000000000:-1
2023-08-02 7:09:01 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2023-08-02 7:09:01 0 [Note] WSREP: wsrep_load(): Galera 26.4.14(r06a0c285) by Codership Oy <info@codership.com> loaded successfully.
2023-08-02 7:09:01 0 [Note] WSREP: Initializing allowlist service v1
2023-08-02 7:09:01 0 [Note] WSREP: Initializing event service v1
2023-08-02 7:09:01 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2023-08-02 7:09:01 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0
2023-08-02 7:09:01 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: 1fae98e4-30a7-11ee-b48b-1a696f898dba
Seqno: 4930 - 4963
Offset: 1280
Synced: 1
2023-08-02 7:09:01 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 1fae98e4-30a7-11ee-b48b-1a696f898dba, offset: 1280
2023-08-02 7:09:01 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/134217752 bytes) complete.
2023-08-02 7:09:01 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2023-08-02 7:09:01 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 4930-4963
2023-08-02 7:09:01 0 [Note] WSREP: GCache::RingBuffer unused buffers scan... 0.0% ( 0/56867952 bytes) complete.
2023-08-02 7:09:01 0 [Note] WSREP: Recovering GCache ring buffer: found 6/40 locked buffers
2023-08-02 7:09:01 0 [Note] WSREP: Recovering GCache ring buffer: free space: 79235968/134217728
2023-08-02 7:09:01 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (56867952/56867952 bytes) complete.
2023-08-02 7:09:01 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = mariadb-galera-dev-0.mariadb-galera-dev-internal.mariadb-operator.sal; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evsk_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gc= 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fgcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807;
2023-08-02 7:09:01 0 [Note] WSREP: Start replication
2023-08-02 7:09:01 0 [Note] WSREP: Connecting with bootstrap option: 0
2023-08-02 7:09:01 0 [Note] WSREP: Setting GCS initial position to 00000000-0000-0000-0000-000000000000:-1
2023-08-02 7:09:01 0 [Note] WSREP: Using CRC-32C for message checksums.
2023-08-02 7:09:01 0 [Note] WSREP: backend: asio
2023-08-02 7:09:01 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2023-08-02 7:09:01 0 [Note] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2023-08-02 7:09:01 0 [Note] WSREP: restore pc from disk failed
2023-08-02 7:09:01 0 [Note] WSREP: GMCast version 0
2023-08-02 7:09:01 0 [Note] WSREP: (750c83dc-803e, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2023-08-02 7:09:01 0 [Note] WSREP: (750c83dc-803e, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2023-08-02 7:09:01 0 [Note] WSREP: EVS version 1
2023-08-02 7:09:01 0 [Note] WSREP: gcomm: connecting to group 'mariadb-operator', peer 'mariadb-galera-dev-0.mariadb-galera-dev-internal.mariadb-operator.svc.clusterb-galera-dev-1.mariadb-galera-dev-internal.mariadb-operator.svc.cluster.local:,mariadb-galera-dev-2.mariadb-galera-dev-internal.mariadb-operator.svc.cluster.local:'
2023-08-02 7:09:01 0 [Note] WSREP: (750c83dc-803e, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://10.244.4.10:4567
2023-08-02 7:09:04 0 [Note] WSREP: EVS version upgrade 0 -> 1
2023-08-02 7:09:04 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2023-08-02 7:09:04 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2023-08-02 7:09:04 0 [Note] WSREP: view(view_id(NON_PRIM,750c83dc-803e,1) memb {
750c83dc-803e,0
} joined {
} left {
} partitioned {
})
2023-08-02 7:09:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50275S), skipping check
2023-08-02 7:09:34 0 [Note] WSREP: PC protocol downgrade 1 -> 0
2023-08-02 7:09:34 0 [Note] WSREP: view((empty))
2023-08-02 7:09:34 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at ./gcomm/src/pc.cpp:connect():160
2023-08-02 7:09:34 0 [ERROR] WSREP: ./gcs/src/gcs_core.cpp:gcs_core_open():221: Failed to open backend connection: -110 (Connection timed out)
2023-08-02 7:09:35 0 [ERROR] WSREP: ./gcs/src/gcs.cpp:gcs_open():1669: Failed to open channel 'mariadb-operator' at 'gcomm://mariadb-galera-dev-0.mariadb-galera-dev-db-operator.svc.cluster.local,mariadb-galera-dev-1.mariadb-galera-dev-internal.mariadb-operator.svc.cluster.local,mariadb-galera-dev-2.mariadb-galera-dev-internal.marsvc.cluster.local': -110 (Connection timed out)
2023-08-02 7:09:35 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2023-08-02 7:09:35 0 [ERROR] WSREP: wsrep::connect(gcomm://mariadb-galera-dev-0.mariadb-galera-dev-internal.mariadb-operator.svc.cluster.local,mariadb-galera-dev-1.mdev-internal.mariadb-operator.svc.cluster.local,mariadb-galera-dev-2.mariadb-galera-dev-internal.mariadb-operator.svc.cluster.local) failed: 7
2023-08-02 7:09:35 0 [ERROR] Aborting
Expected behaviour Mariadb-operator could either
- provide an automated recovery
- provide a manual recovery procedure
Steps to reproduce the bug
Freshly deploy mariadb-operator (v0.19.0) using the following yaml:
apiVersion: mariadb.mmontes.io/v1alpha1
kind: MariaDB
metadata:
name: mariadb-galera-dev
namespace: mariadb-operator
spec:
rootPasswordSecretKeyRef:
name: mariadb
key: root-password
database: mariadb
username: mariadb
passwordSecretKeyRef:
name: mariadb
key: password
connection:
secretName: mariadb-conn
secretTemplate:
key: dsn
image:
repository: mariadb
tag: "11.0.2"
pullPolicy: IfNotPresent
port: 3306
replicas: 3
podSecurityContext:
runAsUser: 0
galera:
enabled: true
sst: mariabackup
replicaThreads: 1
agent:
image:
repository: ghcr.io/mariadb-operator/agent
tag: "v0.0.2"
pullPolicy: IfNotPresent
port: 5555
kubernetesAuth:
enabled: true
authDelegatorRoleName: mariadb-galera-auth
gracefulShutdownTimeout: 5s
recovery:
enabled: true
clusterHealthyTimeout: 1m
clusterBootstrapTimeout: 5m
podRecoveryTimeout: 3m
podSyncTimeout: 3m
initContainer:
image:
repository: ghcr.io/mariadb-operator/init
tag: "v0.0.5"
pullPolicy: IfNotPresent
volumeClaimTemplate:
resources:
requests:
storage: 50Mi
accessModes:
- ReadWriteOnce
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "kubernetes.io/hostname"
tolerations:
- key: "mariadb.mmontes.io/ha"
operator: "Exists"
effect: "NoSchedule"
podDisruptionBudget:
maxUnavailable: 66%
updateStrategy:
type: RollingUpdate
myCnf: |
[mysqld]
bind-address=0.0.0.0
default_storage_engine=InnoDB
binlog_format=row
innodb_autoinc_lock_mode=2
max_allowed_packet=512M
relay-log = relay-bin
log-slave-updates = 1
expire_logs_days = 7
max_binlog_size = 1000M
## Safety
##
skip_host_cache
skip_name_resolve
max_allowed_packet=16M
max_connect_errors=1000000
sql_mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY
sysdate_is_now=1
log_bin=mysql-bin
expire_logs_days=14
# Disabling for performance per http://severalnines.com/blog/9-tips-going-production-galera-cluster-mysql
sync_binlog=0
## Caches and Limits
tmp_table_size=32M
max_heap_table_size=32M
# Re-enabling as now works with Maria 10.1.2
query_cache_type=1
query_cache_limit=4M
query_cache_size=256M
max_connections=3000
thread_handling = pool-of-threads
thread_stack = 192K
thread_cache_size = 4
thread_pool_size = 8
thread_cache_size=50
thread_pool_oversubscribe = 3
open_files_limit=65535
table_definition_cache=4096
table_open_cache=4096
## InnoDB
#innodb=FORCE
innodb_strict_mode=1
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_log_files_in_group=2
innodb_log_file_size=1G
innodb_flush_log_at_trx_commit=2
innodb_file_per_table=1
# 80% Memory is default reco.
# Need to re-evaluate when DB size grows
innodb_buffer_pool_size=12G
innodb_buffer_pool_instances=8
innodb_file_format=Barracuda
innodb_read_io_threads=8
innodb_write_io_threads=4
innodb_log_buffer_size = 512M
resources:
# requests:
# cpu: 4000m
# memory: 8Gi
limits:
memory: 32Gi
volumeClaimTemplate:
resources:
requests:
storage: 400Gi
storageClassName: rook-ceph-block
accessModes:
- ReadWriteOnce
service:
type: ClusterIP
Additional context We’re using rook for storage provisioning. All disks report to be healthy and no crashes have been reported by ceph (or kubernetes) over the last 24 hours.
Environment details:
- Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:13:28Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
- mariadb-operator version: 0.19.0
- Install method: helm
- Install flavour: recommended
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 18 (9 by maintainers)
Hey @tflabs-nl ! thanks a lot for your effort diagnosing this. This is very valuable not only for your use case but also for the troubleshooting section I’m currently writing.
right, but then this explains why the
agentfails trying to recover them, returning the error:these log lines might be pointing the root cause:
after quickly googling them, I’ve seen other people reporting similar problems:
will dig deeper and get back to you
cc @grooverdan
Thanks a lot! 🙏🏻
Hi @mmontes11 thank you for checking this out!
I’ve set the operator log level to debug in the deployment and verified that it got properly applied. Also, I changed the recovery times to the following:
{"clusterBootstrapTimeout":"50m0s","clusterHealthyTimeout":"15m0s","enabled":true,"podRecoveryTimeout":"30m0s","podSyncTimeout":"30m0s"}Meanwhile I noticed this line in the logs:
probably because of unclean shutdown on (eventually) all nodes no last commit/record UUID is stored. What is the best way to bootstrap a cluster in which all nodes failed using the operator?
Meanwhile this piece of debugging info is recorded by the operator: