etcd: defrag stops the process completely for noticeable duration

What I observed

The defrag command blocks all the requests handled by the process for a while when the process has much data and incoming requests. New requests and ongoing streaming requests such as Watch() can be canceled by their deadlines and the cluster can be unstable.

Side effects

This “stop the world” behavior makes the operation of etcd clusters harder. We have the 8GB limitation of the DB size and need to control the DB size not to reach the border. However, clusters won’t release DB space even after compaction until you run the defrag command. That means you cannot track this important metric without virtually stopping every etcd processe periodically.

(Btw, can we add a new metric for “actual DB usage”?)

CC/ @xiang90

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 21 (18 by maintainers)

Most upvoted comments

@gyuho

we probably want to change the behavior to non-stop-the-world.

@cenkalti pending entries are stored in memory but also be written to WAL already. Pending entries must be committed entries that has been persisted in majority of the nodes in the cluster.

Copying my comment from https://github.com/kubernetes/kubernetes/issues/93280

An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:

  • Write request to “any” node (leader or follower) incurs a proposal message to leader for log replication.
  • The log replication proceeds regardless of the local storage state: Committed index may increment while applied index stalls (e.g. due to slow apply or defragmentation). A state machine still accepts proposals.
  • As a result, leader’s applied index may fall behind follower’s, or vice versa.
  • The divergence stops when the gap between applied and committed index reaches 5,000: Mutable requests fail with an error “too many requests”, while immutable requests may succeed.
  • Worse if slow apply happens due to defrag: all immutable requests block as well.
  • Even worse if the server with pending applies crashes: volatile states may be lost – violates durability. Raft advances without waiting for state machine layer apply.

In the short term, we may reduce the upper bound from 5k to lower number. Also, we can use consistent index to apply pending entries to safely survive the restarts. Right now, consistent index is only used for snapshot comparison and preventing re-applies.

Please correct me if I am wrong.