karmada: Replicas may be rescheduled even if `Failover` is disabled

What happened:

In some circumstance even if Failover is disabled, replicas in some member cluster still have a change to be removed especially when scaling up.

E.g.,

  1. We have a deployment with a 5 desired replicas which are propagated to member cluster A.
  2. The cluster is a little bit busy so the connection between karmada-agent and kube-apiserber of member A is lost for about 1 minute. So the cluster status is unhealthy.
  3. It happens that the deployment has been scaled up to 10 replicas at this time.
  4. Even if we disable Failover, the scaling up triggers the scheduling procedure, now the karmada-scheduler will remove replicas in cluster A which may be dangerous. However, it is possible that there is nothing wrong with the 5 replicas in member cluster A.

What you expected to happen: We should not delete the replicas in unhealthy cluster.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Karmada version:
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
  • Others:

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 17 (17 by maintainers)

Most upvoted comments

let me list the rescheduling scenario, the following is what I think now:

  1. propagation policy changes
  2. cluster is unhealthy when Failover is enabled
  3. cluster is unhealthy when Failover is disabled
  4. cluster is deleted
  5. a new cluster which is fit to PP is added
  6. replicas of workload changes when the scheduled clusters are healthy
  7. replicas of workload changes when some scheduled clusters are unhealthy

Hey guys, I have some preliminary ideas about this issue.

First let’s focus why this bug occurs.

  1. When dividing replicas, we remove the previous target clusters which are not selected by scheduler in this round. See here.
  2. All replicas not in selected clusters will be considered as need-to-rescheduled ones. So we trigger scale-up-scheduling for a second sheduling.

Actually, when a cluster is unhealthy, it’s dangerous to delete all replicas in this cluster for. Maybe a lot of workloads will be affected.

Therefore, I prefer not to let karmada-scheduler remove target clusters in spec.clusters. Learn from what kube-scheduler does, it only cares about scheduling a pod to a node while descheduler focuses on evict pods in undesired nodes based on different policy. Here is my plan.

  1. Add a new controller cluster_lifecycle_controller

This controller is inspired by node_lifecycle_controller in kubernetes. It is responsible for remove previous target clusters in spec.clusters when a cluster have some situations, e.g., the cluster is deleted or the cluster is unhealthy(only taking effect when Failover is enabled). This function is pretty like when a node is unhealthy, node_lifecycle_controller evict pods by a specified eviction rate.

  1. Stop removing previous target clusters in karmada-scheduler when dividing replicas.

The karmada-scheduler does not remove previous target clusters and only cares about adding new target clusters or adjust replicas. All eviction should be handled by other components, like cluster_lifecycle_controller or descheduler.

  1. How to handle ‘propagation policy changes’?

Actually, must we need to reschedule all workloads without considering all previous PP and result? I’m not sure about whether this behavior is proper. But now the behavior seems like a little bit arbitrary. If we decide to keep this action, we could delete all previous result in spec.clusters before dividing replicas.

By the way, do you think descheduler should be responsible for these scenario? @Garrybest

Sounds good. Descheduler is meant to evict pods for workload. Now it only focus on unschedulable pods. The failover replicas could be counted on.

However, descheduler works frequently (default interval: 2 minutes). Any scenario associated with PP change should be handled by scheduler, not by scheduler. We could have a further discussion tomorrow.

By the way, do you think descheduler should be responsible for these scenario? @Garrybest