kubernetes: The non-affinity setting of deploying multiple replicas pods does not work, when a pod restarts abnormally

What happened?

My kubernetes cluster has three nodes, I normally start a deployment, it has three replicas, I set podAntiAffinity, so that the three pods will be evenly distributed on different nodes when the conditions are met. The current problem is that when an internal exception occurs in a pod (or kubectl delete pod is manually executed), and then the pod is restarted, there is often no pod on a node, and one node has two pods. Normally, the three nodes have always satisfied the scheduling non-affinity setting. Although the preferredDuringSchedulingIgnoredDuringExecution is set, why are some nodes not scheduled to pods? I’m wondering if this is a problem, or if something else requires additional setup. My guess is that when a pod is abnormally terminated, the pod has not been completely terminated, and a new pod scheduling starts, and the pod in the terminated state also participates in the non-affinity scheduling strategy, if it is In this way, doesn’t the kubernetes deployment controller wait for a pod to be completely terminated before starting a new pod scheduling? Here are some non-affinity settings from the deployment controller’s pod spec:

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              taskManagerId: 30caaec8-747f-4425-bc63-b6b8412b4fb4
          topologyKey: kubernetes.io/hostname
        weight: 1
  containers:

What did you expect to happen?

After deploying multiple replicas pods with non-affinity set, even if some pods restart abnormally, when all nodes meet the scheduling conditions, the pods should be scheduled in a balanced manner

How can we reproduce it (as minimally and precisely as possible)?

In a three-node cluster, deployment controller After the non-affinity of the three replicas pods is set, manually kubectl delete one of the pods, it is very likely that a node has scheduled two pods

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
NAME="Linux"
VERSION="Release 1.1.2"
ID="Linux"
VERSION_ID="Release 1.1.2"
PRETTY_NAME="Linux Release 1.1.2"
ANSI_COLOR="0;31"
$ uname -a
Linux cnode2 3.10.0-1160.31.1.hl06.el7.x86_64 #1 SMP Mon Aug 16 08:24:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (12 by maintainers)

Most upvoted comments

kube-scheduler takes decisions based on the pods that are currently running. And this might include pods that are currently terminating.

This is by design, as pods might take arbitrary amounts of time to terminate.

If you want a different behavior, there needs to be an API for it.