memcached: restart: unable to recover data in some cases
Describe the bug
After running with --memory-file
enabled for a few days, restarting the process gracefully (SIGUSR1) properly saves the metadata file but is not able to recover the data on next startup.
To Reproduce
Unfortunately I don’t have concrete reproduction steps, but what I’ve observed is that restart + recovery works without issues after the workload has been running for a few hours. After about a ~day, restarting the cluster reveals that about ~10% of machines failing to recover the data. After waiting a couple more days, another restart reveals that ~20% of machines are failing in the same way.
My workload is a relatively constant ~40k read QPS + ~1k write QPS to each memcached instance.
System Information
- OS/Distro: Ubuntu 18.04
- Version of OS/distro:
Linux <hostname scrub> 4.15.0-1057-aws #59-Ubuntu SMP Wed Dec 4 10:02:00 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
- Version of memcached: 1.6.2
- Hardware detail: AWS c5.xlarge
Detail (please include!)
stats settings
STAT maxbytes 4194304000
STAT maxconns 32768
STAT tcpport 11211
STAT udpport 0
STAT inter 0.0.0.0
STAT verbosity 1
STAT oldest 0
STAT evictions on
STAT domain_socket NULL
STAT umask 700
STAT growth_factor 1.25
STAT chunk_size 48
STAT num_threads 4
STAT num_threads_per_udp 4
STAT stat_key_prefix :
STAT detail_enabled no
STAT reqs_per_event 20
STAT cas_enabled yes
STAT tcp_backlog 1024
STAT binding_protocol auto-negotiate
STAT auth_enabled_sasl no
STAT auth_enabled_ascii no
STAT item_size_max 1048576
STAT maxconns_fast yes
STAT hashpower_init 17
STAT slab_reassign yes
STAT slab_automove 1
STAT slab_automove_ratio 0.80
STAT slab_automove_window 30
STAT slab_chunk_max 524288
STAT lru_crawler yes
STAT lru_crawler_sleep 500
STAT lru_crawler_tocrawl 0
STAT tail_repair_time 0
STAT flush_enabled yes
STAT dump_enabled yes
STAT hash_algorithm murmur3
STAT lru_maintainer_thread yes
STAT lru_segmented yes
STAT hot_lru_pct 20
STAT warm_lru_pct 40
STAT hot_max_factor 0.20
STAT warm_max_factor 2.00
STAT temp_lru no
STAT temporary_ttl 61
STAT idle_timeout 0
STAT watcher_logbuf_size 262144
STAT worker_logbuf_size 65536
STAT resp_obj_mem_limit 0
STAT read_buf_mem_limit 0
STAT track_sizes no
STAT inline_ascii_response no
STAT ext_item_size 512
STAT ext_item_age 4294967295
STAT ext_low_ttl 0
STAT ext_recache_rate 2000
STAT ext_wbuf_size 4194304
STAT ext_compact_under 0
STAT ext_drop_under 0
STAT ext_max_frag 0.80
STAT slab_automove_freeratio 0.010
STAT ext_drop_unread no
STAT ssl_enabled yes
STAT ssl_chain_cert <scrub>
STAT ssl_key <scrub>
STAT ssl_verify_mode 3
STAT ssl_keyformat 1
STAT ssl_ciphers NULL
STAT ssl_ca_cert <scrub>
STAT ssl_wbuf_size 16384
END
stats
STAT pid 26539
STAT uptime 141361
STAT time 1585062298
STAT version c26b9e3068dc55a3c189885f27807b2137f88b34
STAT libevent 2.1.8-stable
STAT pointer_size 64
STAT rusage_user 13329.531742
STAT rusage_system 17340.832429
STAT max_connections 32768
STAT curr_connections 13865
STAT total_connections 39887
STAT rejected_connections 0
STAT connection_structures 14733
STAT response_obj_bytes 9344
STAT response_obj_total 8
STAT response_obj_free 6
STAT response_obj_oom 0
STAT read_buf_bytes 65536
STAT read_buf_bytes_free 32768
STAT read_buf_oom 0
STAT reserved_fds 20
STAT cmd_get 1396037459
STAT cmd_set 394019
STAT cmd_flush 0
STAT cmd_touch 0
STAT cmd_meta 0
STAT get_hits 632485600
STAT get_misses 763551859
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 93
STAT delete_hits 41826
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 68444075035
STAT bytes_written 115101114078
STAT limit_maxbytes 4194304000
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 0
STAT hash_power_level 17
STAT hash_bytes 1048576
STAT hash_is_expanding 0
STAT slab_reassign_rescues 0
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 0
STAT slab_reassign_busy_items 0
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 0
STAT lru_crawler_running 0
STAT lru_crawler_starts 11220
STAT lru_maintainer_juggles 5101359
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT ssl_handshake_errors 10406
STAT ssl_unauthorized_connections 0
STAT ssl_authorized_connections 0
STAT ssl_fail_open_connections 38196
STAT time_since_server_cert_refresh 185
STAT bytes 35309297
STAT curr_items 168900
STAT total_items 291375
STAT slab_global_page_pool 3928
STAT expired_unfetched 0
STAT evicted_unfetched 0
STAT evicted_active 0
STAT evictions 0
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 5483101
STAT lrutail_reflocked 2480
STAT moves_to_cold 77822
STAT moves_to_warm 9101
STAT moves_within_lru 477656
STAT direct_reclaims 0
STAT lru_bumps_dropped 0
END
The log output I observe on startup is:
[restart] original memory base: [0x7f8f52000000] new base: [0x7fc05a000000]
[restart] recovery start [1585004816.776776]
[restart] recovery end [1585004816.801179]
[restart] no metadata save file, starting with a clean cache
I’m not familiar with the restartable cache code paths, but I dug into this a bit and added some of my own debug logging. It seems like restart_check
https://github.com/memcached/memcached/blob/02c6a2b62ddcb6fa4569a591d3461a156a636305/restart.c#L72 is getting run multiple times on startup. It fails on the second time here with No such file or directory
, presumably because the metadata file was unlinked in the first run of restart_check
.
It’s also worth noting that the recovery failure is reliabily reproducible with the same data file, e.g. if I SIGUSR1 the process many times, it fails to recover every time. If I delete the data file, start the process, then perform a graceful restart, recovery works. So maybe there is some corruption in the data file itself?
Apologies that I don’t have clear repro steps. Please let me know if there’s any other detail that might be helpful in tracing this down. Thanks!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 47 (47 by maintainers)
Cool. Well this one is another embarrassingly dumb bug:
When chunked items are deleted, the slab class id for the header chunk (small) isn’t restored. So what you’ve got there is a slabbed item chunk header… In class N with chunk size 152 suddenly the restart code thinks it’s hit a chunk of size 512k, which subtracts from page_remain and goes negative, which trips the assert. I will likely also fix these asserts to be runtime hard-failures. After that point the fixer ends up being unaligned
I was able to reproduce the assert failure by adding two lines to the restart.t test to delete one of the chunked items it adds before the restart.
will do a proper fix in the morning as I have to adjust a few things to make it a lot more strict. Thanks again for sticking with it!
Thanks for checking. I’m always impressed with how modern deployment systems make debugging a pure joy. 😕