memcached: restart: unable to recover data in some cases

Describe the bug

After running with --memory-file enabled for a few days, restarting the process gracefully (SIGUSR1) properly saves the metadata file but is not able to recover the data on next startup.

To Reproduce

Unfortunately I don’t have concrete reproduction steps, but what I’ve observed is that restart + recovery works without issues after the workload has been running for a few hours. After about a ~day, restarting the cluster reveals that about ~10% of machines failing to recover the data. After waiting a couple more days, another restart reveals that ~20% of machines are failing in the same way.

My workload is a relatively constant ~40k read QPS + ~1k write QPS to each memcached instance.

System Information

OS/Distro: Ubuntu 18.04
Version of OS/distro: Linux <hostname scrub> 4.15.0-1057-aws #59-Ubuntu SMP Wed Dec 4 10:02:00 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Version of memcached: 1.6.2
Hardware detail: AWS c5.xlarge

Detail (please include!)

stats settings

STAT maxbytes 4194304000
STAT maxconns 32768
STAT tcpport 11211
STAT udpport 0
STAT inter 0.0.0.0
STAT verbosity 1
STAT oldest 0
STAT evictions on
STAT domain_socket NULL
STAT umask 700
STAT growth_factor 1.25
STAT chunk_size 48
STAT num_threads 4
STAT num_threads_per_udp 4
STAT stat_key_prefix :
STAT detail_enabled no
STAT reqs_per_event 20
STAT cas_enabled yes
STAT tcp_backlog 1024
STAT binding_protocol auto-negotiate
STAT auth_enabled_sasl no
STAT auth_enabled_ascii no
STAT item_size_max 1048576
STAT maxconns_fast yes
STAT hashpower_init 17
STAT slab_reassign yes
STAT slab_automove 1
STAT slab_automove_ratio 0.80
STAT slab_automove_window 30
STAT slab_chunk_max 524288
STAT lru_crawler yes
STAT lru_crawler_sleep 500
STAT lru_crawler_tocrawl 0
STAT tail_repair_time 0
STAT flush_enabled yes
STAT dump_enabled yes
STAT hash_algorithm murmur3
STAT lru_maintainer_thread yes
STAT lru_segmented yes
STAT hot_lru_pct 20
STAT warm_lru_pct 40
STAT hot_max_factor 0.20
STAT warm_max_factor 2.00
STAT temp_lru no
STAT temporary_ttl 61
STAT idle_timeout 0
STAT watcher_logbuf_size 262144
STAT worker_logbuf_size 65536
STAT resp_obj_mem_limit 0
STAT read_buf_mem_limit 0
STAT track_sizes no
STAT inline_ascii_response no
STAT ext_item_size 512
STAT ext_item_age 4294967295
STAT ext_low_ttl 0
STAT ext_recache_rate 2000
STAT ext_wbuf_size 4194304
STAT ext_compact_under 0
STAT ext_drop_under 0
STAT ext_max_frag 0.80
STAT slab_automove_freeratio 0.010
STAT ext_drop_unread no
STAT ssl_enabled yes
STAT ssl_chain_cert <scrub>
STAT ssl_key <scrub>
STAT ssl_verify_mode 3
STAT ssl_keyformat 1
STAT ssl_ciphers NULL
STAT ssl_ca_cert <scrub>
STAT ssl_wbuf_size 16384
END

stats

STAT pid 26539
STAT uptime 141361
STAT time 1585062298
STAT version c26b9e3068dc55a3c189885f27807b2137f88b34
STAT libevent 2.1.8-stable
STAT pointer_size 64
STAT rusage_user 13329.531742
STAT rusage_system 17340.832429
STAT max_connections 32768
STAT curr_connections 13865
STAT total_connections 39887
STAT rejected_connections 0
STAT connection_structures 14733
STAT response_obj_bytes 9344
STAT response_obj_total 8
STAT response_obj_free 6
STAT response_obj_oom 0
STAT read_buf_bytes 65536
STAT read_buf_bytes_free 32768
STAT read_buf_oom 0
STAT reserved_fds 20
STAT cmd_get 1396037459
STAT cmd_set 394019
STAT cmd_flush 0
STAT cmd_touch 0
STAT cmd_meta 0
STAT get_hits 632485600
STAT get_misses 763551859
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 93
STAT delete_hits 41826
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 68444075035
STAT bytes_written 115101114078
STAT limit_maxbytes 4194304000
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 0
STAT hash_power_level 17
STAT hash_bytes 1048576
STAT hash_is_expanding 0
STAT slab_reassign_rescues 0
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 0
STAT slab_reassign_busy_items 0
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 0
STAT lru_crawler_running 0
STAT lru_crawler_starts 11220
STAT lru_maintainer_juggles 5101359
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT ssl_handshake_errors 10406
STAT ssl_unauthorized_connections 0
STAT ssl_authorized_connections 0
STAT ssl_fail_open_connections 38196
STAT time_since_server_cert_refresh 185
STAT bytes 35309297
STAT curr_items 168900
STAT total_items 291375
STAT slab_global_page_pool 3928
STAT expired_unfetched 0
STAT evicted_unfetched 0
STAT evicted_active 0
STAT evictions 0
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 5483101
STAT lrutail_reflocked 2480
STAT moves_to_cold 77822
STAT moves_to_warm 9101
STAT moves_within_lru 477656
STAT direct_reclaims 0
STAT lru_bumps_dropped 0
END

The log output I observe on startup is:

[restart] original memory base: [0x7f8f52000000] new base: [0x7fc05a000000]
[restart] recovery start [1585004816.776776]
[restart] recovery end [1585004816.801179]
[restart] no metadata save file, starting with a clean cache

I’m not familiar with the restartable cache code paths, but I dug into this a bit and added some of my own debug logging. It seems like restart_check https://github.com/memcached/memcached/blob/02c6a2b62ddcb6fa4569a591d3461a156a636305/restart.c#L72 is getting run multiple times on startup. It fails on the second time here with No such file or directory, presumably because the metadata file was unlinked in the first run of restart_check.

It’s also worth noting that the recovery failure is reliabily reproducible with the same data file, e.g. if I SIGUSR1 the process many times, it fails to recover every time. If I delete the data file, start the process, then perform a graceful restart, recovery works. So maybe there is some corruption in the data file itself?

Apologies that I don’t have clear repro steps. Please let me know if there’s any other detail that might be helpful in tracing this down. Thanks!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 47 (47 by maintainers)

Most upvoted comments

Cool. Well this one is another embarrassingly dumb bug:

When chunked items are deleted, the slab class id for the header chunk (small) isn’t restored. So what you’ve got there is a slabbed item chunk header… In class N with chunk size 152 suddenly the restart code thinks it’s hit a chunk of size 512k, which subtracts from page_remain and goes negative, which trips the assert. I will likely also fix these asserts to be runtime hard-failures. After that point the fixer ends up being unaligned

I was able to reproduce the assert failure by adding two lines to the restart.t test to delete one of the chunked items it adds before the restart.

will do a proper fix in the morning as I have to adjust a few things to make it a lot more strict. Thanks again for sticking with it!

dormando on Apr 5, 2020

Thanks for checking. I’m always impressed with how modern deployment systems make debugging a pure joy. 😕

dormando on Mar 27, 2020