kubernetes: Potential bug of PodTopologySpread when nodeAffinity is specified

We know that nodeAffinity/nodeSelector is honored during the calculating of PodTopolgySpread, however, when some existing pods match the incoming pod’s topologySpreadConstratints, while also excluded by the nodeAfffinity, things get a bit tricky. Raising this issue to discuss if we should read it as a bug.

Here are the repro steps:

Suppose you have a 3-node cluster, worker1 and worker2 belong to zoneA, and worker3 belongs to zoneB:

kind-worker          Ready    kubernetes.io/hostname=kind-worker,zone=zoneA
kind-worker2         Ready    kubernetes.io/hostname=kind-worker2,zone=zoneA
kind-worker3         Ready    kubernetes.io/hostname=kind-worker3,zone=zoneB

And make a deployment with 3 replicas. Suppose 2 land on worker2, and 1 land on worker.

pause-58dffb5c6b-4kb5z   1/1     Running   kind-worker2
pause-58dffb5c6b-d4nzr   1/1     Running   kind-worker2
pause-58dffb5c6b-jpv9t   1/1     Running   kind-worker3

Create a new deployment with topologySpreadConstraints as well as nodeAffinity (not schedule to worker2):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: pause
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: NotIn
                values: ["kind-worker2"]
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.6

The test pod will land on worker3.

pause-58dffb5c6b-4kb5z   1/1     Running   kind-worker2
pause-58dffb5c6b-d4nzr   1/1     Running   kind-worker2
pause-58dffb5c6b-jpv9t   1/1     Running   kind-worker3
test-7f87db4875-7gqxh    1/1     Running   kind-worker3

This behavior concerns me: it looks like the 2 pods on worker2 are taken into consideration, even though worker2 is excluded by nodeAffinity, so the pods on zoneA/zoneB distribute as 2:2. However, shouldn’t we disregard the matching pods that are excluded by nodeAffinity? In that case, the test pod should land on worker1 to reach a pods distribution as 1:1.

Please share your thoughts.

/sig scheduling /kind bug

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (16 by maintainers)

Most upvoted comments

One thing to be aware of is whether a potential fix could render some pods unschedulable, thus be a breaking change. And that’s where use cases weigh in.

alculquicondor on Dec 14, 2021

I think we should consider it as a bug.

node	kind-worker	kind-worker2	kind-worker3	kind-worker4
zone	zone1	zone1	zone2	zone2
existingPods	pod1	pod2	pod3, pod4	pod5
nodeStatus	resourceExhausted	resourceExhausted	normal	normal

Imagine in this situation, all of these pods are labeled with app: pause, then we want to apply a deployment.yaml like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: pause
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: NotIn
                values: ["kind-worker3"]
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.6

kind-worker3 is excluded in scheduling. But now when we try to calculate the skew, we will still calculate the pods in kind-worker3. Since {zone: zone1} : {zone: zone2} is 2:3, then we will schedule the new pod to kind-worker or kind-worker1, but since they are resource exhausted, then it will be in pending state.

But if we filtered out the kind-worker3, the pod will be scheduled to kind-worker4 successfully. I think it’s the right algorithm.

kerthcet on Dec 13, 2021