kubernetes: Potential bug of PodTopologySpread when nodeAffinity is specified

We know that nodeAffinity/nodeSelector is honored during the calculating of PodTopolgySpread, however, when some existing pods match the incoming pod’s topologySpreadConstratints, while also excluded by the nodeAfffinity, things get a bit tricky. Raising this issue to discuss if we should read it as a bug.

Here are the repro steps:

  1. Suppose you have a 3-node cluster, worker1 and worker2 belong to zoneA, and worker3 belongs to zoneB:
    kind-worker          Ready    kubernetes.io/hostname=kind-worker,zone=zoneA
    kind-worker2         Ready    kubernetes.io/hostname=kind-worker2,zone=zoneA
    kind-worker3         Ready    kubernetes.io/hostname=kind-worker3,zone=zoneB
    
  2. And make a deployment with 3 replicas. Suppose 2 land on worker2, and 1 land on worker.
    pause-58dffb5c6b-4kb5z   1/1     Running   kind-worker2
    pause-58dffb5c6b-d4nzr   1/1     Running   kind-worker2
    pause-58dffb5c6b-jpv9t   1/1     Running   kind-worker3
    
  3. Create a new deployment with topologySpreadConstraints as well as nodeAffinity (not schedule to worker2):
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: test
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: pause
      template:
        metadata:
          labels:
            app: pause
        spec:
          topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                app: pause
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: NotIn
                    values: ["kind-worker2"]
          containers:
          - name: pause
            image: k8s.gcr.io/pause:3.6
    
  4. The test pod will land on worker3.
    pause-58dffb5c6b-4kb5z   1/1     Running   kind-worker2
    pause-58dffb5c6b-d4nzr   1/1     Running   kind-worker2
    pause-58dffb5c6b-jpv9t   1/1     Running   kind-worker3
    test-7f87db4875-7gqxh    1/1     Running   kind-worker3
    

This behavior concerns me: it looks like the 2 pods on worker2 are taken into consideration, even though worker2 is excluded by nodeAffinity, so the pods on zoneA/zoneB distribute as 2:2. However, shouldn’t we disregard the matching pods that are excluded by nodeAffinity? In that case, the test pod should land on worker1 to reach a pods distribution as 1:1.

Please share your thoughts.

/sig scheduling /kind bug

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

One thing to be aware of is whether a potential fix could render some pods unschedulable, thus be a breaking change. And that’s where use cases weigh in.

I think we should consider it as a bug.

node kind-worker kind-worker2 kind-worker3 kind-worker4
zone zone1 zone1 zone2 zone2
existingPods pod1 pod2 pod3, pod4 pod5
nodeStatus resourceExhausted resourceExhausted normal normal

Imagine in this situation, all of these pods are labeled with app: pause, then we want to apply a deployment.yaml like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: pause
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: NotIn
                values: ["kind-worker3"]
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.6

kind-worker3 is excluded in scheduling. But now when we try to calculate the skew, we will still calculate the pods in kind-worker3. Since {zone: zone1} : {zone: zone2} is 2:3, then we will schedule the new pod to kind-worker or kind-worker1, but since they are resource exhausted, then it will be in pending state.

But if we filtered out the kind-worker3, the pod will be scheduled to kind-worker4 successfully. I think it’s the right algorithm.