etcd: defrag stops the process completely for noticeable duration
What I observed
The defrag command blocks all the requests handled by the process for a while when the process has much data and incoming requests.
New requests and ongoing streaming requests such as Watch()
can be canceled by their deadlines and the cluster can be unstable.
Side effects
This “stop the world” behavior makes the operation of etcd clusters harder. We have the 8GB limitation of the DB size and need to control the DB size not to reach the border. However, clusters won’t release DB space even after compaction until you run the defrag command. That means you cannot track this important metric without virtually stopping every etcd processe periodically.
(Btw, can we add a new metric for “actual DB usage”?)
CC/ @xiang90
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 21 (18 by maintainers)
@gyuho
we probably want to change the behavior to non-stop-the-world.
@cenkalti pending entries are stored in memory but also be written to WAL already. Pending entries must be committed entries that has been persisted in majority of the nodes in the cluster.
Copying my comment from https://github.com/kubernetes/kubernetes/issues/93280
An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:
In the short term, we may reduce the upper bound from 5k to lower number. Also, we can use consistent index to apply pending entries to safely survive the restarts. Right now, consistent index is only used for snapshot comparison and preventing re-applies.
Please correct me if I am wrong.