kubernetes: Potential bug of PodTopologySpread when nodeAffinity is specified
We know that nodeAffinity/nodeSelector is honored during the calculating of PodTopolgySpread, however, when some existing pods match the incoming pod’s topologySpreadConstratints, while also excluded by the nodeAfffinity, things get a bit tricky. Raising this issue to discuss if we should read it as a bug.
Here are the repro steps:
- Suppose you have a 3-node cluster, worker1 and worker2 belong to zoneA, and worker3 belongs to zoneB:
kind-worker Ready kubernetes.io/hostname=kind-worker,zone=zoneA kind-worker2 Ready kubernetes.io/hostname=kind-worker2,zone=zoneA kind-worker3 Ready kubernetes.io/hostname=kind-worker3,zone=zoneB
- And make a deployment with 3 replicas. Suppose 2 land on worker2, and 1 land on worker.
pause-58dffb5c6b-4kb5z 1/1 Running kind-worker2 pause-58dffb5c6b-d4nzr 1/1 Running kind-worker2 pause-58dffb5c6b-jpv9t 1/1 Running kind-worker3
- Create a new deployment with topologySpreadConstraints as well as nodeAffinity (not schedule to worker2):
apiVersion: apps/v1 kind: Deployment metadata: name: test spec: replicas: 1 selector: matchLabels: app: pause template: metadata: labels: app: pause spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: pause affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: NotIn values: ["kind-worker2"] containers: - name: pause image: k8s.gcr.io/pause:3.6
- The test pod will land on worker3.
pause-58dffb5c6b-4kb5z 1/1 Running kind-worker2 pause-58dffb5c6b-d4nzr 1/1 Running kind-worker2 pause-58dffb5c6b-jpv9t 1/1 Running kind-worker3 test-7f87db4875-7gqxh 1/1 Running kind-worker3
This behavior concerns me: it looks like the 2 pods on worker2 are taken into consideration, even though worker2 is excluded by nodeAffinity, so the pods on zoneA/zoneB distribute as 2:2. However, shouldn’t we disregard the matching pods that are excluded by nodeAffinity? In that case, the test pod should land on worker1 to reach a pods distribution as 1:1.
Please share your thoughts.
/sig scheduling /kind bug
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (16 by maintainers)
One thing to be aware of is whether a potential fix could render some pods unschedulable, thus be a breaking change. And that’s where use cases weigh in.
I think we should consider it as a bug.
Imagine in this situation, all of these pods are labeled with
app: pause
, then we want to apply a deployment.yaml like this:kind-worker3
is excluded in scheduling. But now when we try to calculate the skew, we will still calculate the pods inkind-worker3
. Since{zone: zone1}
:{zone: zone2}
is2:3
, then we will schedule the new pod tokind-worker
orkind-worker1
, but since they are resource exhausted, then it will be in pending state.But if we filtered out the
kind-worker3
, the pod will be scheduled tokind-worker4
successfully. I think it’s the right algorithm.