pd: Region does not distribute after plugging new server
Hi all, I have a problem related to distribution data in the cluster after i plugged new servers into the current cluster. I have two datacenter in one city. Firstly, i only ran one datacenter (id: 1) with 5 servers. After that I plugged in a new datacenter (id: 2) with 3 servers. I set up leaders to only place in the datacenter 1, so servers of datacenter 2 have no leader. Then, I monitored and saw that data was rebalanced from datacenter 1 to datacenter 2. But there is a problem. Can see below image:

We can easily see node 1.0.1.20 (node in datacenter 1) and node 1.0.0.23 (node in datacenter 2) has more region score than other nodes, although leaders is balance between nodes in cluster.
Before I plugged servers of datacenter 2 to a TiDB cluster of datacenter 1, servers of datacenter 1 have the same region and data size.
Besides, we also triggered the rebalance region but it did not resolve the problem.
My config here:
» config show
{
"replication": {
"enable-placement-rules": "true",
"enable-placement-rules-cache": "false",
"isolation-level": "dc",
"location-labels": "zone,dc,rack,host",
"max-replicas": 5,
"strictly-match-label": "false"
},
"schedule": {
"enable-cross-table-merge": "true",
"enable-joint-consensus": "true",
"high-space-ratio": 0.7,
"hot-region-cache-hits-threshold": 2,
"hot-region-schedule-limit": 8,
"hot-regions-reserved-days": 7,
"hot-regions-write-interval": "10m0s",
"leader-schedule-limit": 4,
"leader-schedule-policy": "size",
"low-space-ratio": 0.8,
"max-merge-region-keys": 200000,
"max-merge-region-size": 20,
"max-pending-peer-count": 64,
"max-snapshot-count": 64,
"max-store-down-time": "30m0s",
"merge-schedule-limit": 8,
"patrol-region-interval": "10ms",
"region-schedule-limit": 2048,
"region-score-formula-version": "v2",
"replica-schedule-limit": 64,
"split-merge-interval": "1h0m0s",
"tolerant-size-ratio": 20
}
}
I think that PD seems to wrongly calculate region score lead to unbalance region score.
I have some questions:
- What is happened problem?
- Why does it happen?
- How to balance data in the cluster? How to fix it?
Thank you.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 33 (15 by maintainers)
I think in your case, the region is distributed evenly at the rack level since you have set 5 replicas. From the store information, we can find that the number of stores in
1AL33:1AF38:1AR25:1AP09:1AP07
is 2:2:1:1:2. So the store in the rack1AR25
and1AP09
, which are store 1 and store 263685, will have double region number than others.If you are using TiUP, you can see https://docs.pingcap.com/tidb/dev/scale-tidb-using-tiup for more details
You can take a look at https://docs.pingcap.com/tidb/stable/schedule-replicas-by-topology-labels
PD won’t automatically set a label for TiKV. So it could be that you have a such configuration in TiKV. You can change the store label by using pd-ctl. See https://docs.pingcap.com/tidb/stable/pd-control#store-delete--cancel-delete--label--weight--remove-tombstone--limit--store_id---jqquery-string. e.g.,
store label 1 zone a dc b rack c host d --force
will overwrite the store 1 label to{zone:a, dc:b, rack:c host:d}