KeyDB: Keydb replica is sending an error to its master
Describe the bug We have 2 keydb pods in active-replica. For some reason one of the pods started sends errors
== CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay.
In this moment this pod was near its maxmemory limit(in our case 2gb) and started to send OOM for any new write. However, we have volatile-lru eviction policy and it didn’t cleanup memory. As result it was near maxmemory around 4 hours, until we mentioned the issue. The most interesting thing that this pod had just 4 keys and there were nothing to clean, and second alive pod had almost 300 and used just around 700mb of memory. On another pod there were no any errors in logs at all. After pod restart it back to normal and started to use around 150mb after sync with second pod.
# Memory
used_memory:2319649624
used_memory_human:2.16G
used_memory_rss:2144624640
used_memory_rss_human:2.00G
used_memory_peak:2323309320
used_memory_peak_human:2.16G
used_memory_peak_perc:99.84%
used_memory_overhead:10226950
used_memory_startup:9109968
used_memory_dataset:2309422674
used_memory_dataset_perc:99.95%
allocator_allocated:2320237592
allocator_active:2333794304
allocator_resident:2410905600
total_system_memory:67557654528
total_system_memory_human:62.92G
used_memory_lua:44032
used_memory_lua_human:43.00K
used_memory_scripts:200
used_memory_scripts_human:200B
number_of_cached_scripts:1
maxmemory:2147483648
maxmemory_human:2.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.01
allocator_frag_bytes:13556712
allocator_rss_ratio:1.03
allocator_rss_bytes:77111296
rss_overhead_ratio:0.89
rss_overhead_bytes:-266280960
mem_fragmentation_ratio:0.92
mem_fragmentation_bytes:-174982480
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:17186
mem_clients_normal:34372
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0
Replication config seems to be ok as well:
role:active-replica
master_global_link_status:up
Master 0:
master_host:keydb-1
master_port:6379
master_link_status:up
master_last_io_seconds_ago:6
master_sync_in_progress:0
slave_repl_offset:3301405475576
slave_priority:100
slave_read_only:0
connected_slaves:1
slave0:ip=keydb-1-ip,port=6379,state=online,offset=3927683809290,lag=0
master_replid:b8f99c843eba332f8f4da5aa6979083df56ea236
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:3927683809290
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:3927682760715
repl_backlog_histlen:1048576
** Log Files **
== CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay
Keydb image: eqalpha/keydb:x86_64_v6.0.8
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 12
- Comments: 16
Same here… would love to see some movement on resolving this issue.
** Log Files ** == CRITICAL == This replica is sending an error to its master: ‘command did not execute’ after processing the command 'rreplay
Keydb image: eqalpha/keydb:x86_64_v6.2.1
Based on short investigation we deducted out that it was due to multi-master, active-replica configuration that caused it. Rebooting of one of these masters caused a some sort of sync burst between the nodes. If I remember correctly, for us it was seen in 3 > node setups, but not in 2 node setups.
We deducted that we can live without multi master, active-replica setup and moved to master-slave configuration for stability. In larger db use case reverted back to redis.