karmada: Replicas may be rescheduled even if `Failover` is disabled
What happened:
In some circumstance even if Failover is disabled, replicas in some member cluster still have a change to be removed especially when scaling up.
E.g.,
- We have a deployment with a 5 desired replicas which are propagated to member cluster A.
- The cluster is a little bit busy so the connection between karmada-agent and kube-apiserber of member A is lost for about 1 minute. So the cluster status is unhealthy.
- It happens that the deployment has been scaled up to 10 replicas at this time.
- Even if we disable Failover, the scaling up triggers the scheduling procedure, now the karmada-scheduler will remove replicas in cluster A which may be dangerous. However, it is possible that there is nothing wrong with the 5 replicas in member cluster A.
What you expected to happen: We should not delete the replicas in unhealthy cluster.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Karmada version:
- kubectl-karmada or karmadactl version (the result of
kubectl-karmada versionorkarmadactl version): - Others:
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 17 (17 by maintainers)
let me list the rescheduling scenario, the following is what I think now:
Hey guys, I have some preliminary ideas about this issue.
First let’s focus why this bug occurs.
Actually, when a cluster is unhealthy, it’s dangerous to delete all replicas in this cluster for. Maybe a lot of workloads will be affected.
Therefore, I prefer not to let
karmada-schedulerremove target clusters inspec.clusters. Learn from what kube-scheduler does, it only cares about scheduling a pod to a node while descheduler focuses on evict pods in undesired nodes based on different policy. Here is my plan.cluster_lifecycle_controllerThis controller is inspired by
node_lifecycle_controllerin kubernetes. It is responsible for remove previous target clusters inspec.clusterswhen a cluster have some situations, e.g., the cluster is deleted or the cluster is unhealthy(only taking effect whenFailoveris enabled). This function is pretty like when a node is unhealthy,node_lifecycle_controllerevict pods by a specified eviction rate.karmada-schedulerwhen dividing replicas.The
karmada-schedulerdoes not remove previous target clusters and only cares about adding new target clusters or adjust replicas. All eviction should be handled by other components, likecluster_lifecycle_controllerordescheduler.Actually, must we need to reschedule all workloads without considering all previous PP and result? I’m not sure about whether this behavior is proper. But now the behavior seems like a little bit arbitrary. If we decide to keep this action, we could delete all previous result in
spec.clustersbefore dividing replicas.Sounds good. Descheduler is meant to evict pods for workload. Now it only focus on
unschedulablepods. The failover replicas could be counted on.However, descheduler works frequently (default interval: 2 minutes). Any scenario associated with PP change should be handled by scheduler, not by scheduler. We could have a further discussion tomorrow.
By the way, do you think descheduler should be responsible for these scenario? @Garrybest